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Chapter I 

Discretization of Rational Data / Jonathan Mugan and Klaus Truemper 1 

Frequently, one wants to extend the use of a classification method that, in principle, requires records with 
True/False values, so that records with rational numbers can be processed. In such cases, the rational 
numbers must first be replaced by True/False values before the method may be applied. In other cases, a 
classification method, in principle, can process records with rational numbers directly, but replacement 
by True/False values improves the performance of the method. The replacement process is usually called 
discretization or binarization. This chapter describes a recursive discretization process called Cutpoint. 
The key step of Cutpoint detects points where classification patterns change abruptly. The chapter includes 
computational results where Cutpoint is compared with entropy-based methods that, to date, have been 
found to be the best discretization schemes. The results indicate that Cutpoint is preferred by certain 
classification schemes, while entropy-based methods are better for other classification methods. Thus, 
one may view Cutpoint to be an additional discretization tool that one may want to consider. 

Chapter II 

Vector DNF for Datasets Classifications: Application to the Financial Timing Decision Problem / 
Massimo Liquori and Andrea Scozzari 24 

Traditional classification approaches consider a dataset formed by an archive of observations classified 
as positive or negative according to a binary classification rule. In this chapter, we consider the financial 
timing decision problem, which is the problem of deciding the time when it is profitable for the investor to 
buy shares or to sell shares or to wait in the stock exchange market. The decision is based on classifying 
a dataset of observations, represented by a vector containing the values of some financial numerical at- 
tributes, according to a ternary classification rule. We propose a new technique based on partially defined 


vector Boolean functions. We test our technique on different time series of the Mibtel stock exchange 
market in Italy, and we show that it provides a high classification accuracy, as well as wide applicability 
for other classification problems where a classification in three or more classes is needed. 

Chapter III 

Reducing a Class of Machine Learning Algorithms to Logical Commonsense 

Reasoning Operations / Xenia Naidenova 41 

The purpose of this chapter is to demonstrate the possibility of transforming a large class of machine- 
learning algorithms into commonsense reasoning processes based on using well-known deduction and 
induction logical rules. The concept of a good classification (diagnostic) test for a given set of positive 
examples lies in the basis of our approach to the machine-learning problems. The task of inferring all 
good diagnostic tests is formulated as searching the best approximations of a given classification (a 
partitioning) on a given set of examples. The lattice theory is used as a mathematical language for con- 
structing good classification tests. The algorithms of good tests inference are decomposed into subtasks 
and operations that are in accordance with main human commonsense reasoning rules. 

Chapter IV 

The Analysis of Service Quality Through Stated Preference Models and 

Rule-Based Classification / Giovanni Felici and Valerio Gatta 65 

The analysis of quality of services is an important issue for the planning and the management of many 
businesses. The ability to address the demands and the relevant needs of the customers of a given service 
is crucial to determine its success in a competitive environment. Many quantitative tools in the areas 
of statistics and mathematical modeling have been designed and applied to serve this purpose. Here we 
consider an application of a well-established statistical technique, the stated preference models (SP), 
to identify, from a sample of customers, significant weights to attribute to different aspects of the ser- 
vice provided; such aspects may additively compose an overall satisfaction index. In addition, such a 
weighting system is applied to a larger set of customers, and a comparison is made between the overall 
satisfaction identified by the SP index and the overall satisfaction directly declared by the customers. 
Such comparison is performed by two rule-based classification systems, decision trees and the logic 
data miner Lsquare. The results of these two tools help in identifying the differences between the two 
measurements from the structural point of view, and provide an improved interpretation of the results. 
The application considered is related to the customers of a large Italian airport. 

Chapter V 

Support Vector Machines for Business Applications / Brian C. Lovell and Christian J. Walder 82 

This chapter discusses the use of support vector machines (SVM) for business applications. It provides 
a brief historical background on inductive learning and pattern recognition, and then an intuitive moti- 
vation for SVM methods. The method is compared to other approaches, and the tools and background 
theory required to successfully apply SVM to business applications are introduced. The authors hope 
that the chapter will help practitioners to understand when the SVM should be the method of choice, as 
well as how to achieve good results in minimal time. 


Chapter VI 

Kernel Width Selection for SVM Classification: A Meta-Learning / 
ShawkatAli and Kate A. Smith 


101 


The most critical component of kernel-based learning algorithms is the choice of an appropriate kernel 
and its optimal parameters. In this chapter, we propose a rule-based metalearning approach for automatic 
radial basis function (rbf) kernel, and its parameter selection for support vector machine (SVM) clas- 
sification. First, the best parameter selection is considered on the basis of prior information of the data, 
with the help of maximum likelihood (ML) method and Nelder-Mead (N-M) simplex method. Then the 
new rule-based metalearning approach is constructed and tested on different sizes of 112 datasets with 
binary class, as well as multiclass classification problems. We observe that our rule-based methodology 
provides significant improvement of computational time, as well as accuracy in some specific cases. 

Chapter VII 

Protein Folding Classification Through Multicategory Discrete SVM / 

Carlotta Orsenigo and Carlo Vercellis 116 

In the context of biolife science, predicting the folding structure of a protein plays an important role 
for investigating its function and discovering new drugs. Protein folding recognition can be naturally 
cast in the form of a multicategory classification problem, which appears challenging due to the high 
number of folds classes. Thus, in the last decade, several supervised learning methods have been applied 
in order to discriminate between proteins characterized by different folds. Recently, discrete support 
vector machines have been introduced as an effective alternative to traditional support vector machines. 
Discrete SVM have been shown to outperform other competing classification techniques both on binary 
and multicategory benchmark datasets. In this chapter, we adopt discrete SVM for protein folding clas- 
sification. Computational tests performed on benchmark datasets empirically support the effectiveness 
of discrete SVM, which are able to achieve the highest prediction accuracy. 


Chapter VIII 

Hierarchical Profiling, Scoring, and Applications in Bioinformatics / Li Liao 130 

Recently, clustering and classification methods have seen many applications in bioinformatics. Some 
are simply straightforward applications of existing techniques, but most have been adapted to cope 
with peculiar features of the biological data. Many biological data take a form of vectors, whose com- 
ponents correspond to attributes characterizing the biological entities being studied. Comparing these 
vectors, a.k.a. profiles, is a crucial step for most clustering and classification methods. We review the 
recent developments related to hierarchical profiling where the attributes are not independent, but rather 
are correlated in a hierarchy. Hierarchical profiling arises in a wide range of bioinformatics problems, 
including protein homology detection, protein family classification, and metabolic pathway clustering. 
We discuss in detail several clustering and classification methods where hierarchical correlations are 
tackled with effective and efficient ways, by incorporation of domain specific knowledge. Relations to 
other statistical learning methods and more potential applications are also discussed. 
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Hierarchical Clustering Using Evolutionary Algorithms / Monica Chi§ 


146 


Clustering is an important technique used in discovering some inherent structure present in data. The 
purpose of cluster analysis is to partition a given data set into a number of groups such that objects in a 
particular cluster are more similar to each other than objects in different clusters. Hierarchical clustering 
refers to the formation of a recursive clustering of the data points: a partition into many clusters, each of 
which is itself hierarchically clustered. Hierarchical structures solve many problems in a large area of 
interests. In this chapter, a new evolutionary algorithm for detecting the hierarchical structure of an input 
data set is proposed. The method could be very useful in economy, market segmentation, management, 
biology taxonomy, and other domains. A new linear representation of the cluster structure within the 
data set is proposed. An evolutionary algorithm evolves a population of clustering hierarchies. Proposed 
algorithm uses mutation and crossover as (search) variation operators. The final goal is to present a data 
clustering representation to quickly find a hierarchical clustering structure. 

Chapter X 

Exploratory Time Series Data Mining by Genetic Clustering / T. Warren Liao 157 

In this chapter, we present genetic-algorithm (GA)-based methods developed for clustering univariate 
time series with equal or unequal length as an exploratory step of data mining. These methods basically 
implement the k-medoids algorithm. Each chromosome encodes, in binary, the data objects serving as the 
k-medoids. To compare their performance, both fixed-parameter and adaptive GAs were used. We first 
employed the synthetic control chart data set to investigate the performance of three fitness functions, 
two distance measures, and other GA parameters such as population size, crossover rate, and mutation 
rate. Two more sets of time series with or without known number of clusters were also experimented: 
one is the cylinder-bell-funnel data and the other is the novel battle simulation data. The clustering 
results are presented and discussed. 
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outcome. This hybrid method integrates multiple rule sets generated by a data-mining algorithm with 
the fitness function of a GA. The solutions of the GA represent intersections among rules providing tight 
parameter bounds. The integration of intuitive rules provides an explanation for each generated control 
setting, and it provides insights into the decision-making process. The ability to analyze parameter trends 
and the feasible solutions generated by the GA with respect to the outcomes is another benefit of the 
proposed hybrid method. The presented approach for deriving control signatures is applicable to various 
domains, such as energy, medical protocols, manufacturing, airline operations, customer service, and so 
on. Control signatures were developed and tested for control of a power-plant boiler. These signatures 


discovered insightful relationships among parameters. The results and benefits of the proposed method 
for the power-plant boiler are discussed in the chapter. 
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Bayesian Belief Networks for Data Cleaning / Enrico Fagiuoli, Sara Omerino, 

and Fabio Stella 204 

The importance of data cleaning and data quality is becoming increasingly clear, as evidenced by the 
surge in software, tools, consulting companies, and seminars addressing data quality issues. In this 
contribution, the authors present and describe how Bayesian computational techniques can be exploited 
for data-cleaning purposes to the extent of reducing the time to clean and understand the data. The pro- 
posed approach relies on the computational device named Bayesian belief network, which is a general 
statistical model that allows the efficient description and treatment of joint probability distributions. 
This work describes the conceptual framework that maps the Bayesian belief network computational 
device to some of the most difficult tasks in data cleaning, namely imputing missing values, complet- 
ing truncated datasets, and outliers detection. The proposed framework is described and supported by a 
set of numerical experiments performed by exploiting the Bayesian belief network programming suite 
named HUGIN. 
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Web page. The objective of this chapter is to show how Web clickstream data can be used to understand 
the most likely paths of navigation in a Web site, with the aim of predicting, possibly online, which 
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to understand, for instance, what is the probability of seeing a page of interest (such as the buying page 
in an e-commerce site) coming from another page. Or what is the probability of entering (or exiting) 


the Web site from any particular page. From a methodological viewpoint, we present two main research 
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Foreword 


The importance of knowledge discovery and data mining is evident by the great plethora of books and 
papers dedicated to this subject. Such methods are finding applications in almost any area of human en- 
deavor. This includes applications in engineering, science, business, medicine, humanities, just to name 
a few. At the same time, however, there is a great confusion about the development and application of 
such methods. The main reason for this situation is that many, if not most, of the books examine issues 
on data mining in a narrow manner. Very few books study issues from the mathematical/algorithmic 
and also the applications point of view simultaneously. Even fewer books present a comprehensive 
view of all the critical issues involved with the development and application of such methods to many 
real-life domains. The present book, edited by two world-renowned scholars, Drs. Giovanni Felici and 
Carlo Vercellis, is a bright example of the most valuable books in this fast emerging field. The emphasis 
of this book on the mathematical aspects of knowledge discovery and data-mining methods makes the 
presentations scientifically sound and easy to understand in depth. 

The 19 chapters of this book have been written by a number of distinguished scholars, from all over 
the world, who discuss the most critical subjects in this area. The book starts by discussing an important 
first step for any application of such methods, that is, how to discretize the data. This step is essential 
as many methods use binary data, while real-life applications may be associated with nonbinary data. If 
the analyst is not careful at this step, then it is possible to end up with too many nonrelevant variables 
that generate computational problems associated with highly dimensional data. A related step is that of 
cleaning the data before they are used to extract the pertinent models. As before, if this step is not done 
properly, the validity of the final results may be in jeopardy. Another interesting topic discussed in this 
book is the development of sophisticated visualization techniques, which are presented in relation with 
many diverse domains, ranging from astronomy to genetics. The successful application of the proposed 
visualization techniques to these two highly demanding application areas witnesses the high potential of 
these methods for a wide spectrum of applications. The high volumes of log data, produced by recording 
the way people surf the Web, provides an exciting opportunity, amid with interesting algorithmic chal- 
lenges, for knowledge discovery and data-mining methods. This fascinated topic is also discussed here. 
Another very interesting subject discussed is how to mine data that come from virtual environments 
that involve some kind of spatial navigation. Such studies involve the analysis of sequences of routes 
of actions. A highly promising direction of research seems to be based on the development of methods 
that attempt to combine characteristics of various approaches. Such methods are known as hybrids, and 
an interesting development of a new hybrid approach and its applications are presented as well. 

No book in this area would be complete without the discussion of logic-based methods that offer 
some unique algorithmic and application advantages. The relevant discussions are done by some of the 
most knowledgeable world-renowned scholars on this subject; logic methods are also applied to the 
analysis of financial data. The potential of using grids for distributed approaches and also parallelism is 
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explained too. Another prominent topic is the use of clustering approaches, which are discussed by developing 
some specialized evolutionary approaches. Clustering is also used in one of the most promising application areas 
for the future, the analysis of time series of data. Applications can be found in many domains as one realizes that 
systems or phenomena of interest usually generate data over time. In such settings, data from one point of time 
are somehow related to the data of the next point of time. Again, this very fascinating problem is discussed in 
great depth, and in an easy-to-understand manner by a distinguished expert in this fast-growing field. An exten- 
sive treatment of the classification technique, known as support vector machines, is provided by three chapters 
of the book; altogether with a complete treatment of the main theory of this method and of the related kernel 
function theory, the new extension of discrete support vector machines is described and applied to bioinformat- 
ics. This type of data is also the topic of other applications described in the book. The picture is completed with 
the description of data-mining approaches for service quality measurement, of fuzzy set and rough set theory 
applied in different contexts, and of other industrial applications of data mining. 

It is quite clear that this book is very valuable to all practitioners and researchers working on different fields 
but unified by the need to analyze their voluminous and complex data. Therefore, it is strongly recommended to 
anyone who has an interest in data mining. Furthermore, it is hoped that others will follow the example of this 
book and present more studies that combine algorithmic developments and applications in the way this edited 
book by Drs. Felici and Vercellis does so successfully. 

Evangelos Triantaphyllou, PhD 
Professor 

Department of Computer Science 
Louisiana State University 
Baton Rouge, LA 70803 USA 
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Preface 


The idea of this book was conceived in June 2004, when a small group of researchers in the data-mining field 
gathered on the shores of the lake of Como, in Italy, to attend a focused conference-MML, Mathematical Methods 
for Learning 2004 — having the objective of fostering the interaction among scholars from different countries and 
with different scientific backgrounds, sharing their research interests in data mining and knowledge discovery. As 
one of the side effects of that meeting, the conference organizers took on the exciting task of editing high quality 
scientific publications, where the main contributions presented at the MML conference could find an appropriate 
place, one next to the other, as they fruitfully did within the conference sessions. Some of the papers presented in 
Como, sharing a focus on mathematical optimization methods for data mining, found their place in a special issue 
of the international journal Computer Optimization and Applications ( COAP 38(1), 2007). Another large group of 
papers constituted the most appropriate building blocks for an edited book that would span a vast area of data-min- 
ing methods and applications, showing, on one hand, the relevance of mathematical methods and algorithms aimed 
at extracting knowledge from data, and on the other hand, how wide the application domains of data mining are. 
Shortly later, such project found interest and support by IGI Global, a dynamic publisher very active in promoting 
research-oriented publications in technological and advanced fields of knowledge. We eventually managed to final- 
ize all the chapters, and moreover, enriched the book with additional research work that, although not presented at 
the MML conference, appear to have a strong relevance within the scope of the book. Most of the chapters have 
evolved since they were presented in 2004, and authors had the opportunity to update their work with additional 
results until the beginning of 2007. 

The Motivations of Data Mining 

The interest in data mining of researchers and practitioners with different backgrounds has increased steadily year 
after year. This growth is due to several reasons. 

First, data mining plays today a fundamental role in analyzing and understanding the vast amount of informa- 
tion collected by business, government, and scientific applications. The ability to analyze large bodies of data and 
extract from them relevant knowledge has become a valuable service for most organizations that operate in the 
highly globalized and competitive business arena. The technical skills required to operate and put to use data-min- 
ing techniques are now appreciated, and often required, by the business intelligence units of financial institutions, 
government agencies, telecommunication companies, service providers, retailers, and distribution operators. 

A second reason is to be found in the excellent and constantly improving quality of the methods and tools 
that are being developed in this field. Advanced mathematical models, state-of-the-art algorithmic techniques, 
and efficient data management systems, combined with a decreasing cost of computational power and computer 
memory, are now able to support data analysts with methodologies and tools that were not available a few years 
ago. Furthermore, such instruments are often available at low cost and with easy-to-use interfaces, integrated into 
well-established data management systems. 

A third reason that is not to be overlooked is connected with the role that data-mining methods are playing in 
providing support to basic research in many scientific areas. To mention an example, biology and genetics are 
currently enjoying the results of the application of advanced mining techniques that allow discovery of valuable 
facts in complex data gathered from experiments in vitro. 
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Finally, we wish to mention the impulse to methodological research that has been given in many areas by 
the open problems posed by data-mining applications. The learning and classification problems coming from 
real-life problems have been exploited through many mathematical theories under different formalizations, and 
theoretical results of unusual relevance have been reached in optimization theory, computer science, and statistics, 
also thanks to the many new and stimulating problems. 

Data Mining as a Practical Science 

Data mining is located at the crossing of different disciplines. Its roots are to be found in the data analysis 
techniques that were originally the main object of the study of statistics. The fundamental ideas at the basis of 
estimation theory, classification, clustering, sampling theory, are indeed still one of the major ingredients of data 
mining. But other methods and techniques have been added to the toolbox of the data analyst, extending the limits 
of the classical parametric statistics with more complex models, reaching their maturity with the actual state of 
knowledge on decision trees, neural networks, support vector machines, just to mention a few. In addition, the 
need to organize and manage large bodies of data has required the deployment of computer science techniques 
for database management, query optimization, optimal coding of algorithms, and other tasks devoted to the stor- 
ing of information in the memory of computers and to the efficient execution of algorithms. 

A common trademark of the modem approaches is the formalization of estimation and classification problems 
arising in data mining as mathematical optimization problems, and the use of consistent algorithmic techniques 
to determine optimal solutions for these problems. Such methodological framework has been strongly supported 
by applied mathematics and operations research (OR), a scientific discipline characterized by a deep integration 
of mathematical theory and practical problems. A significant evidence of the role of OR in data mining is the 
contribution that nonlinear and integer optimization methods have given to the solution of the error minimiza- 
tion functions that need to be optimized to train neural networks and support vector machines. Analogously, 
integer programming and combinatorial optimization have been largely used to solve problems arising in the 
identification of synthetic rule-based classification models and in the selection of optimal subsets of features in 
large datasets. 

Despite its strong methodological characterization, data mining cannot be successfully applied without a deep 
understanding of the semantic of each specific problem, which often requires the customization of existing meth- 
ods or the development of ad hoc techniques, partially based on already existing algorithms. To some extent, the 
real challenge that the data mining practitioner has to face is the selection, among many different methods and 
approaches, of the one that best serves the scope of the task considered, often assessing a compromise between 
the complexity of the chosen model and its generalization capability. 

The Contribution of this Edited Book 

This book aims to provide a rich collection of current research on a broad array of topics in data mining, ranging 
from recent theoretical advancements in the field to relevant applications in diverse domains. Future directions 
and trends in data mining are also identified in most chapters. 

Therefore, this volume should be an excellent guide to researchers, practitioners, and students. Its audience 
is represented by the research community; business executives and consultants; and senior students in the fields 
of data mining, information and knowledge creation, optimization, statistics, and computer science. 

A Guided Tour of the Chapters 

The book is composed of 19 chapters. Each one is authored by a different group of scientists, treats one of the 
many different theoretical or practical aspects of data mining, and is self contained with respect to the treated 
subject. 


The first four chapters deal, to different degrees, with data-mining problems in logic setting, where the main 
purpose is to extract rules in logic format from the available data. 

In particular, Chapter I is written by Johnathan Mugan and Klaus Truemper, and describes a sophisticated 
and complete technique to transform a set of data represented in various formats by means of an extended set of 
logic variables. Such task, often referred to as discretization, or binarization, is a key step in the application of 
logic-based classification methods to data that is described by rational or nominal variables. The chapter extends 
the notion of rational variables with the definition of set variables, for example, variables that are represented 
by their membership functions to one or more sets. The method described is characterized by the fact that the 
set of logic variables extracted is compact, but strongly aimed at the task of classifying, with high precision, the 
available data with respect to a given binary target variable. The algorithm that implements the ideas described 
in the chapter has been implemented and integrated into the logic data mining software Lsquare, made available 
by the authors as open source. 

Chapter II is written by Massimo Liguori and Andrea Scozzari. Here the subject is the use of another well- 
known logic data mining technique, the logical analysis of data (LAD), originally developed at the University 
of Rutgers by the research team led by Peter Hammer. The authors propose an interesting use of this method to 
treat logic classification where the target variable is of ternary nature (i.e., it can assume one of three possible 
values). Even more interesting is the application for which the method has been developed: the financial timing 
decision problem, namely the problem of deciding when to buy and when to sell a given stock to maximize 
the profit of the trading operations. The results presented in this chapter testify how logic methods can give a 
significant contribution in a field where classical statistics has always played the main role. 

Chapter III, authored by Xenia Naidenova, brings to the readers’ attention several interesting theoretical 
aspects of logic deduction and induction that find relevant application in the construction of machine-learning 
algorithms. The chapter treats extensively the many details connected with this topic, and enlightens many 
results with simple examples. The author adopts the lattice theory as the basic mathematical tool, and succeeds 
in proposing a sound integration of inductive and deductive reasoning for learning implicative logic rules. The 
results described are the basis for the implementation of an algorithm that efficiently infers good maximally 
redundant tests. 

In Chapter IV, Giovanni Felici and Valerio Gatta describe a study where the results of a stated preference 
model for measuring quality of service is combined with logic-based data mining to gain deeper insight in the 
system of preferences expressed by the customers of a large airport. The data-mining methods considered are 
decision trees and the logic miner Lsquare. The results are presented in the form of a set of rules that enables 
one to understand the similarities and the differences in two different methods to compute a quality of service 
index. 

The topics of the following three chapters evolve around the concept of support vector machines (SVM), a 
mathematical method for classification and regression emerged in the last decade from statistical learning theory, 
which quickly attained remarkable results in many applications. SVM are based on optimization methods, par- 
ticularly in the field of nonlinear programming, and are a vivid example of the contributions that can be given 
to data mining by state-of-the-art theoretical research in mathematical optimization. 

In Chapter V, Brian C. Lovell and Christian J. Walder provide a rich overview of SVM in the context of 
data mining for business applications. They describe, with high clarity, the basic steps in SVM theory, and then 
integrate the chapter with several practical considerations on the use of this class of methods, comparing it with 
other learning approaches in the context of real-life applications. 

An important role in SVM is played by kernel functions, which provide an implicit transformation of the 
representation of the original space of data into a high dimensional space of features. By means of such transfor- 
mations, SVM can efficiently determine linear transformations in the feature space that correspond to nonlinear 
separations into the original space. 

The identification of the right kernel function is the topic of Chapter VI, written by Shawkat Ali and Kate 
A. Smith, where they describe the application of a metalearning approach to optimally estimate the parameters 
that identify the kernel function before SVM is applied. The chapter highlights clearly the role of parameter 
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estimation in the use of learning models, and discusses how the estimation procedure should be able to adapt to 
the specific dataset under analysis. The experimental analysis provides tests on both binary and multicategory 
classification problems. 

An interesting evolution of S VM is represented by discrete support vector machines, proposed in the last few 
years by Carlotta Orsenigo and Carlo Vercellis, authors of Chapter VII. According to statistical learning theory, 
discrete SVM directly face the minimization of the misclassification rate, within the risk functional, instead of 
replacing it with the misclassification distance as traditional SVM. The problem is then modeled as a mixed- 
integer programming problem. The method, already successful in other applications, is extended and applied 
here to protein folding, a very challenging task in multicategory classification. The experiments performed by 
the authors on benchmark datasets show that the proposed method achieves the highest accuracy in comparison 
to other techniques. 

The use of data-mining methods to extract knowledge from large databases in genetic and biomedical ap- 
plications is increasing at a fast pace, and Chapter VIII, written by Li Liao, deals with this topic. Often the data 
in this context is based on vectors of extremely large dimensions, and specific techniques must be deployed to 
obtain successful results. Li Liao tackles several of the specific problems related with handling biomedical data, 
in particular those related with data described by attributes that are correlated with each other and are organized 
in a hierarchical structure. Clustering and classification methods that exploit the hierarchies in data are considered 
and compared with statistical learning methods. 

Chapters nine and ten both deal with clustering, a fundamental problem in nonsupervised learning. In Chapter 
IX, Monica Chi§ discusses hierarchical clustering, where the clusters are obtained by recursively separating the 
data into groups of similar objects. The methods investigated belong to the family of genetic algorithms, where 
an initial population of chromosomes, corresponding to potential clusters, is evolved at each iteration, generating 
new chromosomes with the objective of minimizing a fitness function. The genetic operators adopted here are 
standard mutation and crossover. 

Evolving on these concepts, T. Warren Liao presents, in Chapter X, a method based on genetic algorithms 
to cluster univariate time series. The study of time series is indeed a very central topic in data analysis, and is 
often overlooked in standard data-mining applications, where the main attention is addressed to multivariate data. 
Time series, on the other hand, present several complex aspects linked to autocorrelation and lag parameters, 
and surely can benefit by the use of the new methods developed in the area of data mining. Using the method 
of the k-medoids, the author compares the performances of three fitness functions, two distance measures, and 
other parameters that characterize the genetic algorithms considered. The chapter presents several experiments 
on data derived from cylinder-bell-funnel data and battle simulation data. 

Chapter XI is a sound example of how advanced data-mining techniques can provide relevant information in 
production systems. Alex Burns, Shital Shah, and Andrew Kusiak describe the implementation of a method that 
integrates genetic algorithms and data mining. The results of a rule-based data-mining algorithm are evaluated 
and scored using a fitness function, and the related methods made available in the context of genetic algorithms. 
Here again we find a strong connection between data analysis and optimization techniques, and we see how certain 
decision problems can be successfully solved building ad hoc procedures, where methodologies and techniques 
from different backgrounds are deployed. The authors describe an application of the method to a power-plant 
boiler and highlight the contribution given to the production process. 

Chapter XII is written by Enrico Fagiuoli, Sara Omerino, and Fabio Stella. It is an interesting work that 
shows how complex models derived from classical statistical techniques can play an important role in the data 
treatment process. The chapter describes the use of Bayesian belief networks to perform data cleaning, a relevant 
problem in most data-mining applications where the information available is obtained with noisy, incomplete, 
or error-prone procedures. Here Bayesian belief networks are used to instantiate missing values of incomplete 
records, to complete truncated datasets, and to detect outliers. The effectiveness of the approach is supported 
by numerical experiments. 

Chapter XIII deals with a similar topic, data cleaning. Here, Chuck P. Lam and David G. Stork describe, in 
a complete and accurate way, the problem of labeling noise, requiring the identification and treatment of records 
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when some of the labels attached to the records are different from the correct value due to some source of noise 
present in the data collection process. The chapter depicts the two main problems in cleaning labeling noise: 
the identification of noise and the consequent revision scheme, through removal, replacement, or escalation to 
human supervision. In particular, the authors examine the k-nearest neighbor method to solve the identification 
problem, while they use probabilistic arguments to evaluate the alternative revision schemes. The public domain 
UCI repository is used as a source of datasets where the proposed methods are tested. 

In Chapter XIV another tool originated in the statistical and stochastic processes environment is used to 
solve a relevant mining problem, clickstream analysis, which is attracting a growing attention. The problem is 
generated by the need to investigate the log files produced when users visit a Web site. These log files report the 
sequence of steps in navigation (clicks) made by the Web users. These applications can be useful for designing 
Web sites and for related business-oriented analysis. The authors of the chapter, Paolo Baldini and Paolo Giudici, 
propose to use Markov chain models to investigate the structure of the most likely navigation path in a Web site, 
with the objective of predicting the next step made by a Web user, based on the previous ones. 

Antonino Staiano, Lara De Vinco, Giuseppe Longo, and Roberto Tagliaferro are the authors of Chapter 
XV. Here the topic is the visualization of complex and multidimensional data for exploration and classification 
purposes. The method used is based on probabilistic principal surfaces: by means of a density function in the 
original space, data are projected into a reduced space defined by a set of latent variables. A special case arises 
when the number of latent variables is equal to three and the projected space is a spherical manifold, particularly 
indicated to represent sparse data. Besides visualization, such reduced spaces can be used to apply classification 
algorithms for efficiently determining surfaces that separate groups of data belonging to different classes. Ap- 
plications of the method for data in the astronomy and genetics domains are discussed. 

Chapter XVI presents an unconventional application of data-mining techniques that assists spatial naviga- 
tion in virtual environments. In this setting, users are able to navigate in a three-dimensional virtual space to 
accomplish a number of tasks. Such navigation may be difficult or inefficient for nonexperienced users, and the 
application discusses the use of data-mining techniques to extract knowledge from the navigation patterns of 
expert users, and create good navigation models. Such process is put into action by a navigation interface that 
implements a frequent wayfinding-sequence method. The authors, Mehmed Kantardzic, Pedram Sadeghian, and 
Walaa M. Sheta, have run experiments in simulated virtual environments, extensively discussed in the chapter. 

Data-mining algorithms can be very demanding from the point of view of computational requirements, 
such as speed and memory, especially when large datasets are analyzed. One possible solution to deal with the 
computational burden is the use of parallel and distributed computing. Such an issue is the topic of Chapter 
XVII, written by Antonio Congiusta, Domenico Talia, and Paolo Trunfio, on the use of grid computing for dis- 
tributed data mining. An integrated architecture that can properly host all the steps of the data analysis process 
(data management, data transfer, data mining, knowledge representation) has been designed and is presented in 
the chapter. The components of this data-mining-oriented middleware, termed knowledge grid, are described, 
explaining how these services can be accessed using the standard open grid architecture model. 

The last two chapters of the book are devoted to two knowledge extraction methods that have received large 
attention in the scientific community. They extend the limits of standard machine learning theory, and can be 
used to build data-mining applications able to deal with unconventional applications, and to provide information 
in an original format. Chapter XVIII is about the use of fuzzy logic. Here Nikos Pelekis, Babis Theodoulidis, 
Ioannis Kopanakis, and Yannis Theodoridis cover the design of a classification heuristic scheme based on fuzzy 
methods. The performances of the method are analyzed by means of extensive simulated experiments. The topic 
of Chapter XIX is the method of rough sets. This method presents several noticeable features that originally 
characterize the rules extracted from the data. The interest of this chapter, written by Yanbing Liu, Menghao 
Wang, and Jong Tang, is also due to the application of the method to analyze and evaluate network topologies 
in routing problems. The application of data mining techniques in network problems associated with telecom- 
munication problems is novel, and is likely to represent a relevant object of research in the future. 
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ABSTRACT 

Frequently, one wants to extend the use of a classification method that, in principle, requires records with 
True/False values, so that records with rational numbers can be processed. In such cases, the rational 
numbers must first be replaced by True/False values before the method may be applied. In other cases, 
a classification method in principle can process records with rational numbers directly, but replacement 
by True/False values improves the performance of the method. The replacement process is usually called 
discretization or binarization. This chapter describes a recursive discretization process called Cutpoint. 
The key step of Cutpoint detects points where classification patterns change abruptly. The chapter includes 
computational results, where Cutpoint is compared with entropy-based methods that, to date, have been 
found to be the best discretization schemes. The results indicate that Cutpoint is preferred by certain 
classification schemes, while entropy-based methods are better for other classification methods. Thus, 
one may view Cutpoint to be an additional discretization tool that one may want to consider. 


INTRODUCTION 

One often desires to apply classification methods 
that, in principle, require records with True/False 
values to records that, besides True/False values, 
contain rational numbers. For ease of reference, 
we call rational number entries rational data 


and refer to True/False entries as logic data. In 
such situations, a discretization process must 
first convert the rational data to logic data. 
Discretization is also desirable in another setting. 
Here, a classification method in principle can 
process records with rational numbers directly, 
but its performance is improved when the 
rational data are first converted to logic data. 
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This chapter describes a method called Outpoint 
for the discretization task, and compares 
its effectiveness with that of entropy-based 
methods, which presently are considered to be 
the best discretization schemes. 

Define nominal data to be elements or subsets 
of a given finite set. In Bartnikowski et al. 
(Bartnikowski, Granberry, Mugan, & Truemper, 
2004), an earlier version of Cutpoint is described 
and used for the transformation of some cases 
of nominal data to logic data. Specifically, the 
nominal data are first converted to rational data, 
which are then transformed to logic data by 
Cutpoint. 

We focus here on the following case. We are 
given records of two training classes, A and 
B, that have been randomly selected from two 
populations A and B, respectively. We want to 
derive a classification scheme from the records 
of A and B. Later, that scheme is to be applied to 
records of A - A and B - B. 

For the purpose of a simplified discussion in 
this section, we assume for the moment that the 
records have no missing entries. That restriction 
is removed in the next section. 

Abrupt Pattern Changes and 
Cutpoint 

Generally, the discretization maybe accomplished 
by the following, well-known approach. One 
defines, for a given attribute, k > 1 breakpoints 
and encodes each rational number of the attribute 
by k True/False values, where the jth value is 
True if the rational number is greater than the jth 
breakpoint, and is False otherwise. The selection 
of the k breakpoints requires care if the records 
of A - A and B - B are to be classified with good 
accuracy. 

A number of techniques for the selection of the 
breakpoints have been proposed, and later in this 
section, we provide a review of those methods. 
Suffice it to say here that the most effective 
methods to date are based on the notion of 
entropy. In these methods, the breakpoints are 
so selected that the rational numbers of a given 


attribute can be most compactly classified by a 
decision tree as coming from A or B. In contrast, 
Cutpoint is based on a different goal. Recall that 
the records of the sets A and B are presumed to 
be random samples of the populations A and 
B. Taking a different viewpoint, we may view 
each record of A - A and B - B to be a random 
variation of some record of A or B, respectively. 
The goal is then to select the breakpoints so 
that these random variations largely leave the 
True/False values induced by the breakpoints 
unchanged. 

Cutpoint aims for the stated goal by selecting 
breakpoints, called markers, that correspond to 
certain abrupt changes in classification patterns, 
as follows. First, for a given attribute, the rational 
numbers are sorted. Second, each value is labeled 
as A or B, depending on whether the value comes 
from a record of A or B, respectively. For the 
sake of a simplified discussion, we ignore, for 
the moment, the case where a rational number 
occurs in both a record of A and a record of 
B. Third, each entry with label A (resp. B) is 
assigned a class value of 1 (resp. 0). Fourth, 
Gaussian convolution is applied to the sequence 
of class values, and the midpoint between two 
adjacent entries, where the smoothed class 
values change by the largest amount, is declared 
to be a marker. 

For example, if the original sorted sequence, 
with class membership in parentheses, is ..., 
10.5(A), 11.7(A), 15.0(A), 16.7(A), 19.5(B), 
15.2(B), 24.1(B), 30.8(B),..., then the sequence 
of class values is ..., 1, 1, 1, 1, 0, 0, 0, 0,.... Note 
the abrupt transition of the subsequence of Is 
to the subsequence of 0s. When a Gaussian 
convolution with small standard deviation a 
is performed on the sequence of class values, 
a sequence of smoothed values results, which 
exhibits a relatively large change at the point 
where the original sequence changes from Is 
to 0s. If this is the largest change for the entire 
sequence of smoothed class values, then the 
original entries 16.7(A) and 19.5(B), which 
correspond to that change, produce a marker 
with value (16.7 + 19.5)/2 = 18.1. 
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Evidently, a large change of the smoothed class 
values corresponds in the original sorted sequence 
of entries to a subsequence of rational numbers, 
mostly from A, followed by a subsequence of 
numbers, mostly from B, or vice versa. We call 
such a situation an abrupt pattern change. Thus, 
markers correspond to abrupt pattern changes. 
We differentiate between two types of abrupt 
pattern changes. We assume, reasonably, that an 
abrupt change, produced by all records of the 
populations A and B, signals an important change 
of behavior and thus should be used to define 
a True/False value. The records of the subsets 
A and B may exhibit portions of such pattern 
changes. We say that these pattern changes of 
the records of A and B are of the first kind. The 
records of A and B may also have additional 
abrupt pattern changes that do not correspond 
to abrupt pattern changes in the records of the 
populations A and B. This is particularly so if 
A and B are comparatively small subsets of the 
populations A and B, as is typically the case. 
We say that the latter pattern changes are of the 
second kind. 

There is another way to view the two kinds of 
pattern changes. Suppose we replace records r 
of A u B by records r of (A - A) u (B - B), 
respectively, where r is similar to r. Then abrupt 
pattern changes of the first (resp. second) kind 
produced by the records r likely (resp. unlikely) 
are abrupt pattern changes produced by the 
records r. 

There is a third interpretation. Suppose we 
extract from the sorted sequence of numerical 
values just the A and B labels. For example, the 
above sequence ..., 10.5(A), 11.7(A), 15.0(A), 
16.7(A), 19.5(B), 15.2(B), 24.1(B), 30.8(B),... 
becomes ..., A, A, A, A, B, B, B, B,.... We call 
this a label sequence. Then for an abrupt pattern 
change of the first (resp. second) kind, the random 
substitution of records r by records r is unlikely 
(resp. likely) to change the label sequence. 
Cutpoint relies on the third interpretation in an 
attempt to distinguish between the two kinds 
of pattern changes, as follows. The method 
estimates the probability that A or B is selected 


in label sequences of abrupt pattern changes of 
the second kind, by assuming p = |A|/(|A| + |B|) 
(resp. q = |B|/(|A|+|B|)) to be the probability for 
the label A (resp. B) to occur. Then the standard 
deviation of the Gaussian convolution process is 
so selected that the following is assured. Suppose 
there is at least one abrupt pattern change that, 
according to the probabilities p and q, has low 
probability and thus is estimated to be of the first 
kind. Then the largest change of the smoothed 
class values and the associated marker tends to 
correspond to one such abrupt pattern change. 
Informally, one may say that the standard 
deviation o is so selected that marker positions 
corresponding to abrupt pattern changes of the 
first kind are favored. 

Cutpoint has been added to the version of the 
Lsquare method of Truemper (2004), which is 
based on prior versions of Felici and Truemper 
(2002) and Felici et al. (Felici, Sun, & Truemper, 
2004). Lsquare computes DNF (disjunctive 
normal form) logic formulas from logic training 
data. Cutpoint initially determines one marker for 
each attribute of the original data, as described 
previously. Let the transformation of A and B 
via these markers produce sets A' and B'. If A' 
and B' cannot be separated by logic formulas, 
then Cutpoint recursively determines additional 
markers. The Cutpoint/Lsquare combination 
is so designed that it does not require user 
specification of parameters or rules, except for 
a limit on the maximum number of markers for 
any attribute. In the tests described later, that 
maximum was fixed to 6. 

Computational Results 

To date, we have used Cutpoint in conjunction 
with Lsquare in a variety of projects such as 
credit rating, video image analysis, and word 
sense disambiguation. In each case, Cutpoint 
has proved to be effective and reliable. 

We also have compared the performance 
of Cutpoint with that of two entropy-based 
methods that differ by the subdivision selection 
and termination criterion. In one of the methods, 
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the criterion is the clash condition of Cutpoint 
introduced later. We refer to this method as 
Entropy CC. For the other method, the criterion 
is the minimum description length (MDL) 
principle (Dougherty, Kohavi, & Sahami, 1995). 
Accordingly, we refer to that method as Entropy 
MDL. 

For the comparison, we applied Cutpoint and 
the two entropy-based methods to a number 
of data sets, and processed the resulting logic 
data by four classification algorithms. The latter 
schemes were so chosen that classification by 
decision trees, naive Bayes methods, support 
vector machines, and learning logic methods 
were represented. Note that we do not claim that 
each of the selected classification methods should 
use discretization as a preprocessing step. But if 
one decides to use such preprocessing, then the 
results indicate the following. Entropy MDL is 
preferred for decision tree methods and support 
vector machines, while Cutpoint is preferred for 
learning logic methods and naive Bayes methods. 
The performance of Entropy CC lies somewhere 
between that of Cutpoint and Entropy MDL, 
and thus is always dominated either by Cutpoint 
or Entropy MDL. These general conclusions 
are based on average performance results. 
For specific data sets, preference can be quite 
different. Thus, if one needs highest possible 
accuracy for a given situation, one should try 
all three schemes and select the one performing 
best. 

Optimized Classification 

These results apply only to the situation where 
costs of obtaining attribute data for records need 
not be considered. When such costs are present, 
Cutpoint, Entropy CC and Entropy MDL must be 
modified. For Cutpoint, the needed adjustments 
are discussed toward the end of the chapter. 

In the remainder of this section, we review prior 
work on discretization. For a comprehensive 
survey and a computational comparison of 
techniques, see Liu et al. (Liu, Hussain, Tan, & 
Dash, 2002). 


Entropy-Based Approaches 

The concept of entropy, as used in information 
theory, measures the purity of an arbitrary 
collection of examples (Mitchell, 1997). Suppose 
we have two classes of data, labeled N and P. 
Let n be the number of N instances, and define 
p to be the number of P instances. An estimate 
of the probability that class P occurs in the set 
is p/(p + n), while an estimate of the probability 
that class N occurs is n/(p + n). Entropy is then 
estimated as: 

entropy (/;, n) 

= -^log 2 ^-^Uog 2 ^- (1) 
p + n p + n p + n p + n 

Another value, called gain, indicates the value 
of separating the data records on a particular 
attribute. Let Vbe an attribute with two possible 
values. Define p (resp. n) to be the number of 
P (resp. N) records that contain one of the two 
values. Similarly, let p 2 (resp. n ) be the number 
of P (resp. N) records that contain the second 
value. Then: 

gain = entropy(/;, n) 

[— 1 entropy^, , n, ) + entropy(p 2 , n 2 )] 

p+n p+ n 

( 2 ) 

In generating decision trees, for example, the 
attribute with the highest gain value is used to 
split the tree at each level. 

The simplest approach to discretization is as 
follows. Assume that each record has a rational 
attribute, V. The records are first sorted according 
to V , yielding rational values v t , v 2 , . . . , v k . For each 
pair of values v. and v. +1 , the average of the two is 
a potential marker to separate the P records from 
the N records. For each possible marker, the 
associated gain is computed. The highest gain 
indicates the best marker that separates the two 
classes of data (Quinlan, 1986). The method has 
been further developed to separate rational data 
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into more than just two classes. In Fayyad and 
Irani (1992, 1993), a recursive heuristic for that 
task is described. The multiinterval technique 
first chooses a marker giving minimal entropy. It 
then recursively uses the minimum description 
length (MDL) principle to determine whether 
additional markers should be introduced. 
Another concept, called minimum splits, is 
introduced in Wang and Goh (1997). Minimum 
splits minimize the overall impurity of the 
separated intervals with respect to a predefined 
threshold. Although, theoretically, any impurity 
measurement could be used, entropy is 
commonly chosen. Since many minimum splits 
can be candidates, the optimal split is discovered 
by searching the minimum splits’ space. The 
candidate split with the smallest product of 
entropy and number of intervals is elected to be 
the optimal split. 

Entropy-based methods compete well with other 
data transformation techniques. In Dougherty et 
al. ( 1 995), it is shown not only that discretization 
prior to execution of naive Bayes decision 
algorithms can significantly increase learning 
performance, but also that recursive minimal 
entropy partitioning performs best when 
compared with other discretization methods 
such as equal width interval binning and Holte’s 
1R algorithm (Holte, 1993). More comparisons 
involving entropy-based methods can be found in 
Kohavi and Sahami (1996), which demonstrates 
situations in which entropy-based methods using 
the MDL principle slightly outperform error- 
minimization methods. The error-minimization 
methods used in the comparison can be found 
in Maass (1994) and Auer et al. (Auer, Holte, 
& Maass, 1995). For information regarding 
the performance of entropy-based methods for 
learning classification rules, see An and Cercone 
(1999). 

Bottom-Up Methods 

Bottom-up methods initially partition the data 
set then recombine similar adjacent partitions. 


The basic method is introduced in Srikant and 
Agrawal (1996). Major problems are low speed 
and bloating of the produced rule set. To offset 
long execution times, the number of intervals 
must be reduced. Uninteresting excess rules 
may be pruned using an interest measure. Data 
clustering has been used (Miller & Yang, 1 997) 
to generate more meaningful rules. Yet another 
approach to merging related intervals is used in 
the contrast set miner (Bay & Pazzani, 1999). 
The use of one such machine, called STUCCO, 
is illustrated in Bay (2000). 

Other Approaches 

Bayes’ Law has also been utilized for 
discretization. Wu (1996) demonstrates one such 
method. In it, curves are constructed based upon 
the Bayesian probability of a particular attribute’s 
value in the data set. Markers are placed where 
leading curves differ on two sides. 

A number of investigations have focused on 
simultaneous analysis of attributes during the 
transformation process. Dougherty et al. (1995) 
coin the term dynamic to refer to methods that 
conduct a search through the space of possible 
k values for all features simultaneously. For an 
example method, see Gama et al. (Gama, Torgo, 
& Soares, 1998). 

Relatedly, publications tend to use the term 
multivariate with different interpretations. 
Kwedlo and Krqtowski (1999) refer to a 
multivariate analysis as one that simultaneously 
searches for threshold values for continuous- 
valued attributes. They use such an analysis with 
an evolutionary algorithm geared for decision 
rule induction. Bay (2000), however, declares 
that a multivariate test of differences takes as 
input instances drawn from two probability 
distributions and determines if the distributions 
are equivalent. This analysis maintains the 
integrity of any hidden patterns in the data. 
Boros et al. (Boros, Hammer, Ibaraki, & Kogan, 
1 997) explores several optimization approaches 
for the selection of breakpoints. In each case, all 
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attributes of the records of the training sets A and 
B are considered simultaneously. For example, 
minimization of the total number of breakpoints 
is considered. The reference provides polynomial 
solution algorithms for some of the optimization 
problems and establishes other problems to be 
NP-hard. Boros et al. (Boros, Hammer, fbaraki, 
Kogan, Mayoraz, & Muchnik, 2000) describes a 
discretization method that is integrated into the 
so-called logic analysis of data (LAD) method. 
In that setting, the discretization requires solution 
of a potentially large set-covering problem. 
A heuristic method is employed to solve that 
problem approximately. 

DEFINITIONS 

We need a few definitions for the discussion of 
Cutpoint. 

Unknown Values 

At times, the records of A and B may be 
incomplete. Following Truemper (2004), we 
consider two values that indicate entries to be 
unknown. They ar e Absent and Unavailable. The 
value Absent means that the value is unknown but 
could be obtained, while Unavailable means that 
the value cannot be obtained. Of course, there 
are in-between cases. For example, a diagnostic 
value could be obtained in principle but is 
not determined, since the required test would 
endanger the life of the patient. Here, we force 
such in-between cases to be classified as Absent 
or Unavailable. For the cited diagnostic case, 
the choice Unavailable would be appropriate. 
Another way to view Absent and Unavailable 
is as follows. Absent means that the value is 
unknown, and that this fact is, in some sense, 
independent from the case represented by the 
given record. On the other hand, Unavailable 
tells that the reason why the value is not known 
is directly connected with the case of the record. 


Thus, Unavailable implicitly is information 
about the case of the record, while Absent is 
not. This way of differentiating between Absent 
and Unavailable implies how irrelevant values 
are handled. That is, if a value is declared to 
be irrelevant or inapplicable, then this fact is 
directly connected with the case of the record, 
and thus is encoded by the value Unavailable. 

In prior work, the treatment of unknown 
values typically does not depend on whether 
the unknown value could be obtained. For 
example, the average value of the attribute is 
often used for missing values (Mitchell, 1997). 
As another example, database methods such as 
SQL use NULL to represent unknown entries 
(Ramakrishnan & Gehrke, 2003). In applications, 
we have found the distinction between Absent 
and Unavailable to be useful. For example, a 
physician may declare that it is unnecessary that 
a certain diagnostic value be obtained. In that 
case, we call the value irrelevant and encode it 
by assigning the value Unavailable. Conversely, 
if a diagnostic value is deemed potentially 
useful but is not yet attained, we assign the value 
Absent. 

ft is convenient that we expand the definition 
of logic data and rational data so that Absent 
and Unavailable are allowed. Thus, logic data 
have each entry equal to True, False, Absent, 
or Unavailable, while rational data have each 
entry equal to a rational number, Absent, or 
Unavailable. 

Records 

A record contains any mixture of logic data 
and rational data. There are two sets A and B 
of records. Each record of the sets has the same 
number of entries. For each fixed j, the y'th entries 
of all records are of the same data type. We 
want to transform records of A and B to records 
containing just logic data, with the objective that 
logic formulas, determined by any appropriate 
method, can classify the records correctly as 
coming from A or B. 
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Populations 

Typically, the sets A and B come from 
populations A and B, respectively, and we want 
the transformations and logic formulas derived 
from A and B to classify the remaining records 
of A - A and B - B with high accuracy. 

DNF Formulas 

A literal is the occurrence of a possibly negated 
variable in a logic formula. A disjunctive normal 
form (DNF) formula is a disjunction of conjunc- 
tions of literals. For example, (x ; a wv 2 ) y(x 2 a 
x 3 )y(x 1 a - , xA is a DNF formula. The evaluation of 
DNF formulas requires the following adjustments 
when the values Absent and Unavailable occur. 
Let D be the DNF formula D = D, vD. v- • • vD, , 
where the D. are the DNF clauses. For example, 
we may have D. = x a y a _, z, where x, y, and 
are the literals of logic variables x, y, and z. 

The DNF clause D evaluates to True if the 

j 

variable of each literal has been assigned a True/ 
False value so that the literal evaluates to True. 
For example, D = x Ay /y-'z evaluates to True if x 
= y =True and z =False. The clause D. evaluates 
to False if, for at least one variable occurring in 
D, the variable has a True/False value so that the 

i 

corresponding literal evaluates to False, or if the 
variable has the value Unavailable. For example, 
x = False or x = Unavailable cause D. = x a y 
a - ' z to evaluate to False. If one of these cases 
does not apply, then D. has the value Undecided. 
Thus, the Undecided case occurs if the following 
three conditions hold: (1) Each variable of D.has 
True, False, or Absent as values; (2) there is at 
least one Absent case; and (3) all literals for the 
True/False cases evaluate to True. For example, 
D = x a y a - ' z evaluates to Undecided if x = 
Absent, y = True, and z = False. 

The DNF formula D = D.aD.a---aD, evaluates 
to True if at least one D has value True, to False 

j 

if all D have value False, and to Undecided 

} 

otherwise. Thus in the Undecided case, each D 

i 

has value False or Undecided, and there is at 
least one Undecided case. 


As an aside, prior rules on the treatment of 
unknown values effectively treat them as Absent. 
For example, the stated evaluation of DNF 
formulas for Absent values is consistent with the 
evaluation of logic formulas of SQL for NULL 
values (Ramakrishnan & Gehrke, 2003). 

Clash Condition 

Suppose we desire classification by DNF 
formulas. Specifically, we want two DNF 
formulas of which one evaluates to True on 
the records derived from A and to False on the 
records derived from B, while the second formula 
achieves the opposite True/False values. We call 
these formulas separating. Note that the outcome 
Undecided is not allowed. That value may occur, 
however, when a DNF formula evaluates records 
of (A -A) u (B-B). Effectively, a formula then 
votes for membership in A or B, or declares the 
case to be open. We associate with the vote for 
A and B a numerical value of 1 or -1, resp., and 
assign to the Undecided case the value 0. This 
rule is useful when sets of formulas are applied, 
since then the vote total expresses the strength of 
belief that a record is in A or B. 

There is a simple necessary and sufficient 
condition for the existence of the separating 
formulas. We call it the clash condition. For 
the description of the condition, we assume for 
the moment that the records of A and B contain 
just logic data. We say that an A record and a 
B record clash if the A record has a True/False 
entry for which the corresponding entry of the 
B record has the opposite True/False value or 
Unavailable, and if the B record has a True/ 
False entry for which the corresponding entry of 
the A record has the opposite True/False value 
or Unavailable. 

For example, let each record ofAuB have three 
entries x F x 2 , and x 3 , and suppose that an A record 
is (x = True, x 2 = Unavailable, x 3 = False ) and 
that a B record is (Xj = False, x 2 = True, x 3 = 
False). Then the entry x l = True of the A record 
differs from x { = False of the B record, and thus 
the two records clash. On the other hand, take 
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the same A record, but let the B record be (x ( = 
True, x 2 = Unavailable, x, = Unavailable ). Then 
there is no True/False value in the B record for 
which the A record has the opposite True/False 
value or Unavailable, and thus the two records 
do not clash. 

Define the clash condition to be satisfied by sets 
A and B containing only logic data if every record 
of A clashes with every record of B. The following 
theorem links the existence of separating DNF 
formulas and the clash condition. We omit the 
straightforward proof. 

Theorem 1. Let sets A and B contain just logic 
data. Then two separating DNF formulas exist if 
and only if the clash condition is satisfied. 

Let sets A and B of records be given. Define J to 
be the set of indices j for which the jth entries of 
the given records contain rational data. 

Outpoint recursively defines markers for the 
jth entries of the j e J, where in each pass one 
marker is defined. It is convenient that we divide 
the description of the marker selection into 
two parts, which make up the subsequent two 
sections. The next section covers the selection of 
the initial marker for an arbitrary j e J, and the 
following section deals with the case where an 
additional marker is to be found. 


INITIAL MARKER 

Let j g J. We denote the rational numbers in jth 
position, sorted in increasing order, by 
• • ‘ ^ V For the moment, we ignore all Absent 
and Unavailable values that may occur in the jth 
position. 

Class Values 

We associate with each z. a class value v. that 

l l 

depends on whether z. is equal to any other 
z„ and whether z is in a record of set A or B. 
Specifically, if z. is unique and thus not equal to 
any other z h , then v is 1 (resp. 0) if the record 


with z. as jth entry is in A (resp. B ). If z. is not 
unique, let FI be the set of indices h for which 
z h = z.. Note that i g H. Let H A (resp. H p ) be the 
subset of the h e H for which z h is the jth entry 
of a record in set A (resp. B ). If h e H A (resp. 
h g H b ), we say that z h produces a local class 
value equal to 1 (resp. 0). The class value v. is 
then the average of the local class values for the 
z h with h g H. Thus, v. = [1 • |HJ + 0 ■ \H B \]/\H\ 
or, compacdy, 

v f = \H a \/\H\ (3) 

The formula also covers the case of unique z., 
since then FI = {/} and either H A = {/} or H A = 
0 depending on whether the record with z as jth 
entry is in A or B, respectively. 

For example, suppose z x = 2, z 2 = 5, and z 5 = 
1 0 occur in records of set A, and z 3 = 7 and z 4 = 
10 occur in records of set B. Since z and z 2 are 
unique and occur in records of set A, we have v ; = 
v 2 = 1 . Similarly, uniqueness of z 3 and occurrence 
in a B record produce v 3 = 0. The values z 4 and 
z 5 are equal and exactly one of them, z 5 , occurs 
in a record of set A. Thus for both z 4 and z 5 , we 
have H = {4, 5} and H A = {5}, and by (3), v 4 = v 5 
= \H a \/\H\ = 0.5. 

Recall that a marker corresponds to an abrupt 
change of classification pattern. In terms of 
class values, a marker is a value c where many 
if not all z. close to c and satisfying z. < c have 
high class values, while most if not all z. close 
to c and satisfying z. > c have low class values, 
or vice versa. We identify markers following 
a smoothing of the class values by Gaussian 
convolution, a much used tool. For example, it 
is employed in computer vision for the detection 
of edges in digitized images; see Forsyth and 
Ponce (2003). 

Smoothed Class Values 

Gaussian convolution uses the normal distribution 
with mean equal to 0 for smoothing of data. For 
completeness, we include the relevant formulas. 
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For mean 0 and standard deviation a > 0, the 
probability density function of the normal 
distribution is: 

f (y) — — ]=e~ rK2r5l \ 0<y<oo (4) 

og2n 

In our case, we always choose a to be a positive 
integer. We cover the selection in a moment. 

For any integer g and the selected a, let P g denote 
the probability that the random variable defined 
by f(y) falls into the open interval (g- 0.5, q+0.5). 
Since g is the midpoint of the open unit interval 
( g - 0.5, g + 0.5), we have: 

^=! B g Z ny)<fy = n9) (5) 

The smoothing process uses the P g values to 
derive, from the class values v., smoothed values 
y by the formula: 

oo 

v-ZfV'w (6) 

g=- c° 

The formula relies on the convention that each 
v. +g without defined value, that is, with z + g < 1 
or i + g > N, is declared to be 0. For the values 
of ct of interest and for |g| >. 2 a + 1, the P g are 
sufficiently small that they can be ignored. That 
fact and the relation P g = P for all g, allow us to 
simplify (6) for each actual computation to: 

T- = Po ' V i + X Pg ' ( v i+9 + v i-9 )’ 

9=1 (?) 

The assumption of v. = 0 outside the known 
values Vj, v 2 ,..., v N results in biased or, rather, 

Equation 10. 


unusable values y for 1 < z ^ 2q and N -2a +1 < 
i < IV. As a consequence, we ignore these values 
and declare the remaining y values usable. 

Selection of Standard Deviation 

We select the standard deviation a via an analysis 
of classification patterns. Suppose we produce 
sequences made up of the letters A and B. We 
construct a given sequence by randomly selecting 
one letter at a time, choosing the letter A with 
probability p and the letter B with probability q 
= 1 - p. In the construction of a sequence, we 
begin with the sequence AB. For given k > 1 and 
/ > 1, we adjoin k - 1 As in front of AB and / 
- 1 Bs behind AB. At this point, we have k As 
followed by / Bs. Finally, we add a B in front 
and an A at the end. What is the probability that 
such a sequence S is constructed from AB when 
we randomly select letters and add them first 
in front and then at the end, until a sequence of 
the described form is achieved? Since the initial 
sequence AB is given, the probability is: 

PlS] = p k q l (8) 

For m > 1, consider the event E where the 

■ m 

previously mentioned process constructs any S 
for which k > m or / > m . We add up the appropriate 
probabilities of (8) to get the probability a m that 
E m occurs. Using the fact that the sum of the 
probabilities of all possible cases is 1, that is, 

£pV= 1 (9) 

k > 1 , />1 

we compute a m as shown in equation (10). 


= Z PV+ Z A'- Z pV 

k>m, l> 1 k> 1, l>m k>m, l>m 

= p " 1 - 1 Z pV+cr 1 Z pV-cpp )”' 1 Z pV 

Zt>i, z>i k>i, i > i /c>i, ;>i 

= p ml + q ml ~ (p p) ml 
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Define the length of S to be the number of As and 
Bs minus 2, which is k + 1. Effectively, we do not 
count the initial B of S and the final A of S. The 
expected length L of S is: 

L = X C fc + 0A' 

k>l,l>l 

k> 1 />1 k> 1 />1 

= P g , P 9 
(1 -pf 1 -q 1 -P (1 -qf 

1 

pq 

( 11 ) 

Suppose we have a sequence T of N randomly 
selected As and Bs. What is the expected number 
of the above sequences S occurring in T? For our 
purposes, a sufficiently precise estimate is: 

N / L - N pq (12) 

Of the expected number of sequences S occurring 
in T, the fraction of sequences that qualify for 
being sequences of event E m is approximately 
equal to a m . Thus, a reasonable estimate of the 
expected number of sequences of E m occurring 
in T, which we denote by K(N,m), is: 

K(N,m) = (N/ L)a m 

= N pq[p m 1 +q m ~ l ~(pq) m (13) 

Each S occurring in T is a potential case for a 
marker that corresponds to the point where k 
As transition to / Bs. We do not want markers 
to result from sequences S that likely have been 
produced by randomness. We try to avoid such 
choices as follows. 

Suppose that, for some m > 1, K(N,m ) is 
approximately equal to 1. This implies that, on 
average, there is one sequence S of E m . Since a m 
of (10) decreases geometrically as m increases, 
such a sequence S of E m typically has not much 
more than m As or Bs, and any sequence S’ with 
larger number of As or Bs is very unlikely to 


occur. We use this fact as follows. We select a 
value m*> 1 so that K(N, m *) is as close to 1 as 
possible. Then we choose the standard deviation 
a so that the sequence S with about m* As or Bs 
that we can expect to occur does not produce 
a marker if there is a sequence S' with length 
significantly larger than m* . 

By these arguments, the latter sequence S' is 
unlikely to have been produced by randomness, 
and thus is likely due to a particular behavior of 
the values of the attribute under consideration. 

In terms of the discussion in the introduction, we 
estimate that we have an abrupt pattern change 
of the first kind. 

We achieve the desired effect by selecting a 
= m*. Indeed, that choice produces significant 
probabilities for g, m < g < 2m, and these 
probabilities tend to smooth out the classification 

values v. associated with the As and Bs of all 

/ 

randomly produced sequences S. 

When N is not large, certain boundary effects 
should be addressed. We describe the adjustment 
and then justify it. Instead of demanding that 
K(N, in) is close to 1, we ignore the first and last 
m* As and Bs of the sequence T, and require that 
K(N - 2 m~, 772 *) defined from: 

K(N - 2m, m) = {N - 2m) p q [ + q m ~ l - {p q ) m ~ l ] 

(14) 

is as close to 1 as possible. We motivate 
the adjustment as follows. When Gaussian 
convolution is performed with a = m*, the first 
smoothed class value is computed using the 
values v. of the 4a + 1 As and Bs at the beginning 
of T. Denote that subsequence by T 1 . If the central 
2a + 1 As and Bs of T' contain an S of some E 

• m 

with m < in, then the class values of the As and 
Bs of any such S tend to be smoothed out. Thus, S 
is unlikely to result in a marker if a subsequence 
S' with length greater than m* exists. 

We establish the probability p for (14) and 
compute 772* as follows. We take p to be the 
fraction of the number of training records of 
class A divided by the total number of training 
records, and we find m* by dichotomous search. 
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We note that, due to the symmetry of the formula 
K(N - 2m, m), the choice of m * implicitly also 
considers subsequences in which the roles of A 
and B are reversed. Table 1 shows a as a function 
of N, for a < 10 and p = q = 0.5. 

There is an exceptional case where the selected 
a must be reduced. As we argue shortly — see the 
discussion following (16) — we do not consider 
a marker between z and z. , if v. = v. Thus, no 
marker can be placed if no z satisfies 2a + 2 < i 
<N — 2a and v. ^ v. If that case occurs, several 
corrective actions are possible. We have found 
that reduction of q to 1 is a good choice. If for the 
reduced a there still is no index i satisfying 2a + 
2 < i < N - 2a and vTv. ,, then we declare that 
no intervals should be created for the jth entry; 
as a consequence, we delete the jth entry from 
all records of A and B. Otherwise, we proceed 
with the reduced a = 1 . 

For example, if N = 37, a = 6, and v 13 = 1, v 14 = 
v i 5 = • • • v 30 = 0, v 31 = 1, then no i satisfies 2a 
+ 2 = 14<z<iV-2a = 25 and v. ^ v. .. Thus, 
a should be reduced to 1 . For that value, both i 
= 14 and i = 31, and possibly other values of i, 
satisfy 2a + 2 = 4 < i < N~ 2a = 35 and v. ± v. r 
Thus, a = 1 should be used. On the other hand, 
let N = 37 and a = 6 as before, but suppose = 
1, v 2 = v 3 = • • • = v 35 = 0, v 36 = 1, v 37 = 0. For a = 
6, no i satisfies 2a + 2 = 14 < z < N - 2a = 25. 
Reduction of a to 1 produces the same negative 
conclusion. Thus, no intervals should be created 
for the jth entries, and we delete these entries 
from all records of A and B. 

Table 1. a as function of N for a < 10 and p = 
q = 0.5 


N 

a 

3-7 

i 

8-11 

2 

12-15 

3 

16-31 

4 

32-54 

5 

55-99 

6 

100-186 

7 

187-359 

8 

360-702 

9 

703-1387 

10 


Definition of Marker 

Suppose we have selected a, as described 
previously, and have computed the smoothed 
class values v r As we move along the sequence 
of usable values V, the absolute difference 5 

a Z 

between adjacent y_, and vj, 

6-ly-vvJ (15) 

measures the abruptness with which class values 
change. We call S ; a difference value. The largest 

such value, say g , produces a marker c between 

z 

z , , and z That is, 

i -i i 

c = (z.,_i + Z,) /2 (16) 

The selection rule for c requires a small 
adjustment due to a quirk that may be introduced 
by the convolution process. It is possible that, 
for the selected c, the corresponding original 
class values v* and v. are equal. In case all z. 
are distinct, both values z. and z. separated 
by c come either from A records or from B 
records. If several z. are equal, more complex 
interpretations are possible. However, all of 
them reflect unattractive cases. 

To rule out all such situations, we restrict 
the selection of the difference values g by 
considering 8. values only if v. T v M . Thus, ' 

8 , = max {5 ; | v,.,v,._, e U, v, * v ; _,} (17) 

1 z 

where U is the set of usable values. If the 
maximum is attained by several z*, we pick one 
closest to N/2, breaking any secondary tie by a 
random choice. 

For example, if a = 6 and N = 60, then the vj 
with index z satisfying 2q+ l = 13<i<iV-2a 
= 48 are usable. Suppose these values are vj 3 = 
0.3214, v 14 = 0.3594, vj 5 = 0.4042, v 16 = 0.4439, 
vj 7 = 0.4760, vj 8 = 0.4986,..., v 45 = 0.4740, v 46 = 
0.4410, v 47 = 0.4007, and v 48 = 0.3612. For these 
values, formula (15) produces S 14 = 0.0380, 
8., = 0.0448, 8 ir = 0.0397, 8„ = 0.0321, 8, a = 

1 j lu 1/ lo 

0.0226,..., 8 46 = 0.0330, S 47 = 0.0403, and S 48 = 
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0.0395. Suppose the largest 8. for which v. ^ v M , 
is unique and is S 15 = 0.0448. Thus, f = 15. If z .»= 
z 15 = 7 and z.._ x = z 14 = 5, the marker c is defined 
by c = (z.._+z~.)/2 = (5 + 1)12 = 6. 

The next scheme summarizes the computation 
producing the initial marker c. The scheme 
also outputs the standard deviation a of the 
convolution process since that information 
is needed later in another application of the 
algorithm. 


ALGORITHM INITIAL MARKER 

Input : Rational numbers z x < z 2 < • • • < z N of the 
jth attribute of the records of A and B. 

Output : Either: Marker c for the jth attribute, 
standard deviation a of the convolution process, 
and the difference value 8 , associated with the 

i 

marker. Or: “Marker cannot be determined.” 
Procedure: 

1. (Check if N is too small or if a = 1 
cannot produce a marker.) If N < 6 or if, 
for q = 1 , there is no index i satisfying 
2a + 2 < z < N - 2a and v. ^ v. ., then 
output “Marker cannot be determined,” 
and stop. (In that case, one should delete 
the jth entries from all records of A and 
B.) 

2. (Compute class values.) For z = 1, 2,..., 
N, define H‘ = {h \ z h = z.}, H' A = {h e 

. H | z h is taken from an A record}, and 
compute the class value v. = | H a |/|H'|. 

3. (Define p, q, and a.) Define p = |A|/(|A| 
+ \B\) and q = 1 - p. Let m* be the value 
m>_ 1 for which K(N - 2m, m) is closest 
to 1. Let a = 772 *. If there is no index i 
satisfying 2a + 2 < i < N - 2a and v. ^ 
v. „ lower a to 1 . 

4. (Compute smoothed class values.) For i 
= 1, 2,..., N, use the class values v., the 
standard deviation a, and the P s values 
of (5) to compute the smoothed class 
values v, = |3 0 • v, + X^Pg ( v , +9 + v i - 9 )• 


5. (Select marker.) For i = 2a +2, 2a + 3,..., 
N- 2a, let 8, = | y - v M |. Select i* so that 
§ = max. {8. | v.^v. If i* is not unique, 

select an i* closest to N / 2 and break any 
secondary tie by random choice. Define 
the marker c by c = (z , + z . ) / 2. Output 
the marker c, the standard deviation a, 
and the difference value g . 

i 

ADDITIONAL MARKER 

This section covers how one additional marker 
is selected, assuming that a certain collection 
of markers is already at hand. The procedure is 
invoked if the sets A' and £>', derived from the 
sets A and B via the markers obtained so far, do 
not satisfy the clash condition and thus cannot 
be fully separated. 

Critical Interval 

The markers on hand define intervals of the 
rational line for each index j e . J, and these 
markers produce a transformation of A and B to 
A' and B'. Define such an interval to be critical 
if a properly chosen subdivision can lead to a 
transformation of A and B to, say, A" and B" 
such that A" and B" have more clashing pairs 
of records than A and B’. Clearly, each critical 
interval is associated with a particular attribute 
j g . J, and all critical intervals are readily 
determined via the nonclashing pairs of records 
of A and B’. We omit the obvious process. For 
each critical interval, we compute an additional 
marker using a method virtually identical to 
Algorithm INITIAL MARKER. Specifically, 
the input sets A and B of the algorithm are now 
the subsets AcA and B c Bof records for which 
the values of the associated attribute j e J falls 
into the critical interval. 

The algorithm either outputs a marker together 
with the associated standard deviation a and the 
difference value 8 .., or it declares that a marker 
cannot be found. In the latter case, we do not 
delete any attribute values from A and B, but 
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instead record that the interval cannot be refined, 
and thus exclude it from further consideration. 
When all critical intervals have been processed, 
two cases are possible. Either we have at least 
one additional marker, or no additional markers 
could be determined. In the latter case, the 
transformation process outputs A', B', the markers 
on hand, and the warning message “A and B 
cannot be fully separated,” and then stops. 

If at least one additional marker has been 
determined, we select one of them and proceed 
recursively as described previously. The 
selection of the marker is based on a measure that 
considers the attractiveness of pattern change at 
the point of the marker, and on the number of 
nonclashing pairs of records of A' and B' that 
determine the interval to be critical. The latter 
number is called the relevance count. We first 
discuss the attractiveness of the pattern change. 

Attractiveness of Pattern Change 

The attractiveness of a pattern change is based 
on a lower bound e on the difference values 8. 

l 

of (15) for certain label subsequences. Each 
such subsequence has, for some n > 1 yet to be 
specified, k > n + 1 As followed by / > n + 1 
Bs, and 8. is the difference value produced by 
the last A and the first B of the sequence. We 
establish a lower bound e for 8 . 

l 

Theorem 2. Let a label sequence be given for 
which the original rational numbers z. are all 
distinct. For some n> 1, let a label subsequence 
have k > n + 1 As followed by l>n+ 1 Bs. Then 
e = P 0 - 2|3 n+1 is a lower bound for 8. of (15). 


Proof: Since 8. > 0, the claim is trivial if |3 0 - 
2|3 n+1 < 0. Hence, we suppose that P 0 > 2p n+1 . 
Using the formula (7) for vj in the definition of 8. 
of (15), we have, (see equation (18)). 

Consider 8. produced by the last A and first B of 
the label sequence. Due to the k > n + 1 As (resp. 
/ > n + 1 Bs) in the label subsequence, we have 
v. . = v ., = ••• = v. =1 (resp. v = v.^. = • • • = 
v i+n = 0). We use these class values in (18) and 
simplify to get: 

oo 

8(= I Pn + 1 - Po + Z (P 9 -P fl+ lXv 1+s - V H,_i)| 

g=n+l 

(19) 

Since P 0 > 2p n+1 and, for all g > 0, P g > 2P g+1 , the 
right hand side of (19) is minimum if, for all g 
>n+ 1, we have v. = 1 and v. =0. For that 
case, 8. becomes 8. = |2P n+1 - p 0 | = P 0 - 2P n+1 = 

8 . 

If n is sufficiently large, then the label subsequence 
of Theorem 2 is quite unlikely to be a random 
occurrence. Thus, if the label subsequence does 
occur, we estimate that it corresponds to an 
abrupt pattern change of the first kind. Indeed, 
as was discussed previously, as n grows beyond 
a, this conclusion tends to become valid. For 
example, n = |_1.5aJ is large enough for the 
desired conclusion, and we choose this value of 
n to compute the lower bound s. Thus, 

e = Po-2P Ll . 5aJ+l (20) 


Equation 18. 


8; =1 [PoT + Z P 9 ( V i +S + V i- 9 )] - tPoW_, + Z P 9 OUg-1 + Wg-l )] I 


9 = 1 


9 = 1 


= lZ(P 9 -Pg + i)(v, + g— v,, 9 -i)| 

9=0 
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Let c be a marker, and define 8 .to be the change 
of smoothed class values corresponding to the 
marker c. To measure how likely the marker c 
corresponds to a pattern change of the first kind, 
we compare § with s. Specifically, if the ratio: 

i 

8,/8 = 8,/(P 0 -2|3 Ll5nJ+I ) (21) 

is near or above 1, then we estimate that we 
likely have a pattern change of the first kind. 
Thus, the ratio § /s measures the attractiveness 
of the marker. We say that the marker c has 
attractiveness § /e. 

i 

Selection of Marker 

For each critical interval for which we have 
determined an additional marker, define the 
potential of the marker to be the product of 
the relevance count of the interval and the 
attractiveness of the marker. Letting y and 
R denote the potential and relevance count, 
respectively, we have, for each marker, the 
potential y as: 

y = f?6 i ./8 = i?S j ./(P 0 -2P Ll5(jJ+I ) (22) 

We select the marker with highest potential, add 
that marker to the list of markers on hand, and 
proceed recursively as described earlier. 

For example, suppose we have two critical 
intervals. For the first interval, we have a = 6, N 
= 58, 5 ,= 0.037, and R = 12. For a = 6, we have 
s = (P 0 - 2P^ 5n j +1 ) = 0.035, and the potential is y 
= f?§ /e = 12(0.037/0.035) = 12.7. If the second 
critical interval has a smaller potential, then we 
refine the first interval. Suppose that for the first 
interval we have z = 17 and z , ; = 14. Then the 
new marker is p = (z.*_, + z,.)/2 = (14+17)/2 = 
15.5. 

We summarize the selection process. 


ALGORITHM ADDITIONAL MARKER 

Input: List of critical intervals. 

Output: Either: “No critical interval can be 
refined.” or: Additional marker for one critical 
interval. 

Procedure: 

1 . For each critical interval, do Algorithm 
INITIAL MARKER where the input 
sets are the subsets 4cA and BcB 
of records for which the value of the 
associated attribute j e. J falls into the 
critical interval. If the algorithm declares 
that no marker can be determined, 
remove the interval from the list of 
candidates. 

2. If the list of critical intervals is empty, 
output “No critical interval can be 
refined,” and stop. 

3. For each critical interval, use the value 
5 , and a determined in Step 1 and the 
relevance count R to compute the 
potential y = f?§ / (p 0 - 2 P Ll 5tjJ+1 ). 

4. Select the critical interval with maximum 
potential. In case of a tie, favor the 
interval with larger number of z. values, 
and break any secondary tie randomly. 
Using f of the associated 5 ., output the 
marker p = (z.,_ + z ,.)/2 for 'the selected 
interval, and stop. 


OUTPOINT ALGORITHM 

With algorithms INITIAL MARKER and 
ADDITIONAL MARKER at hand, we are ready 
to describe the entire algorithm of Outpoint. 

We begin the scheme as follows. For each j e J, 
we carry out algorithm INITIAL MARKER and 
thus get either a marker, say c , or conclude that a 
marker cannot be obtained. In the latter case, the 
attribute j is deleted from all records of A and B. 
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For the reduced sets, which we again denote by 
A and B, we select a j e J whose marker has the 
largest associated difference value § ,. We apply 
the transformation implied by that single marker 
and thus obtain sets A' and B'. 

We test if A\ and B' satisfy the clash condition. 
If this is not so, that is, if at least one record of 
A' and one record of B' do not clash, then we 
compute one additional marker with algorithm 
ADDITIONAL MARKER, update the sets A’ 
and B' accordingly, and proceed recursively. 
That is, we test if the sets A' and B' satisfy the 
clash condition, and so on. 

The process stops either when A' and B' satisfy 
the clash condition, or when an additional marker 
cannot be determined. In the implementation of 
the method, we also stop introducing additional 
markers for a given j e J when the number of 
markers reaches a specified maximum. In the 
tests described later, that limit was set to six. 

As an option, one may also stop the refinement 
process if the introduction of an additional 


marker does not reduce the total number of 
nonclashing pairs of records. Indeed, since that 
marker had the highest potential among the 
possible choices and did not reduce the number 
of nonclashing pairs, one might conjecture that 
additional markers may not be desirable in any 
interval, and thus may terminate the refinement 
process. 

When the recursive process terminates, some 
attributes j e J may not have received any marker. 
Of course, one such marker was determined by 
algorithm INITIAL MARKER in the initial part 
of Outpoint, and we now assign that marker. 
Thus, each j e J now has at least one marker. 
At this point, we have the desired collection of 
markers. We use them for one final update of the 
sets A' and £>'. 

We output the collection of markers and the 
associated sets A' and B'. If these sets do not 
satisfy the clash condition, we also output the 
warning message “A and B cannot be fully 
separated.” 


Table 2. Accuracy of C4.5 on Testing Data 


Dataset 

Cutpoint 

Entropy CC 

Entropy MDL 

heart 

78.90 

77.98 

77.31 

australian 

85.22 

86.23 

85.94 

hepatitis 

78.71 

75.48 

76.78 

horse-colic 

84.53 

83.46 

88.62 

boston housing 

80.63 

77.06 

83.59 

Wisconsin breast 

94.98 

93.69 

95.41 

crx 

85.07 

86.23 

86.09 

haberman 

69.92 

70.91 

69.26 

ionosphere 

88.31 

90.88 

90.88 

pima 

75.13 

75.26 

73.56 

spectf 

74.11 

79.77 

77.86 

Overall 

81.41 

81.54 

82.30 
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COMPUTATIONAL RESULTS 

Cutpoint has been used so far in a variety of 
projects including credit rating, video image 
analysis, and word sense disambiguation. In 
each of the numerous cases, Cutpoint has proved 
to be effective and reliable. 

We also have compared Cutpoint with the two 
entropy-based methods Entropy CC and Entropy 
MDL, described in the introduction. 

For the comparison, we selected the following 
data sets of the UC Irvine repository of machine- 
learning databases: Cleveland heart, australian, 
hepatitis, horse-colic, boston housing, Wisconsin 
breast, crx, haberman, ionosphere, pima, and 
spectf. Boston housing was run using the 
attribute median housing value with a cutoff 
point of $21,000. For both the heart and the 
crx databases, some attributes were treated as 
nominal. 

In a 5 -fold cross-validation approach, we used 
Cutpoint, Entropy CC, and Entropy MDL for 


Table 3. Accuracy of Naive Bayes on Testing Data 


Dataset 

Cutpoint 

Entropy CC 

Entropy MDL 

heart 

82.87 

81.89 

83.87 

australian 

85.07 

85.80 

85.36 

hepatitis 

83.23 

82.58 

81.29 

horse-colic 

82.63 

82.63 

82.34 

boston-housing 

80.03 

79.63 

78.64 

wisconsin-breast 

96.98 

96.55 

97.13 

crx 

85.07 

86.09 

85.80 

haberman 

69.92 

69.92 

67.94 

ionosphere 

90.88 

90.88 

88.89 

pima 

76.18 

73.83 

75.26 

spectf 

77.48 

73.00 

72.60 

Overall 

82.76 

82.07 

81.74 


discretization, and finally applied four classifica- 
tion schemes. The latter methods were chosen so 
that classification by decision trees, naive Bayes 
methods, support vector machines, and learning 
logic formulas were represented. For the first three 
cases we chose the version J4.8 of C4.5, the naive 
Bayes method, and the SMO support vector ma- 
chine implemented by Witten and Frank (2000). 
In each case, we used the default parameters for 
the runs, except that we carried out 5-fold cross- 
validation instead of 10-fold cross-validation. 
For the fourth method, we selected the Lsquare 
version of Truemper (2004). We emphasize that 
we do not claim that discretization is needed 
or even desired for the first three classification 
methods. We do say that, if one contemplates 
a discretization preprocessing step followed by 
application of a classification method of one of 
the four types, then one may want to consider the 
results shown in Table 2. 
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Prediction Accuracy for Testing Data 

For a compact representation and ease of compari- 
son, we have grouped the results for each of the 
four classification methods. Tables 2-5 show the 
accuracy established by the 5 -fold cross-validation 
process. The headings contain the term “Testing 
Data,” which may seem superfluous. The term 
has been added to differentiate the results from 
a second evaluation, to be discussed shortly. 
Note that the performance data should only be 
used for a comparison of Cutpoint, Entropy CC, 
and Entropy MDL, and not for evaluation of the 
four classification methods. The reason is that 
each classification method almost certainly is not 
the best method of that type. For example, there 
are better commercial decision-tree methods than 
C4.5, there are numerous ongoing developments 
concerning naive Bayes methods and support 
vector machines, and full implementation of 


Table 4. Accuracy ofSMO on Testing Data 


Dataset 

Cutpoint 

Entropy CC 

Entropy MDL 

heart 

82.89 

82.94 

80.92 

australian 

85.80 

85.94 

85.22 

hepatitis 

81.29 

80.00 

77.42 

horse-colic 

85.37 

83.99 

86.97 

boston housing 

79.07 

80.65 

80.83 

Wisconsin breast 

95.69 

95.69 

95.40 

crx 

85.36 

85.94 

85.80 

haberman 

70.25 

70.91 

74.84 

ionosphere 

88.88 

91.46 

90.02 

pima 

74.86 

74.59 

75.90 

spectf 

76.77 

79.73 

80.84 

Overall 

82.38 

82.89 

83.10 


Lsquare, as conceived at present, has not yet 
been accomplished. 

From the results, we conclude the following. 
Based on the average performance, Cutpoint 
is the preferred approach for naive Bayes and 
Lsquare, while Entropy MDL is best for C4.5 and 
SMO. Furthermore, Entropy CC is dominated 
by Cutpoint and Entropy MDL. 

When one examines the performance for indi- 
vidual data sets, then the preference is not clear- 
cut. For example, for naive Bayes and Lsquare, 
Cutpoint is best for 7 of the 11 data sets. In the 
case of C4.5 and SMO, Entropy MDL is best for 
5 of the 11 data sets. Also, there are several cases 
where Entropy CC is better than Cutpoint and 
Entropy MDL. Thus, if one needs highestpossible 
accuracy for a given situation, and if sufficient 
time and data are available to estimate accuracy, 
then one should try all three discretization schemes 
and select the one performing best. 
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Table 5. Accuracy of Lsqcc on Testing Data 


Dataset 

Outpoint 

Entropy CC 

Entropy MDL 

heart 

83.25 

78.25 

81.22 

australian 

86.67 

85.65 

83.77 

hepatitis 

80.65 

80.00 

76.77 

horse-colic 

85.62 

87.00 

88.61 

boston housing 

83.80 

83.99 

82.60 

Wisconsin breast 

95.83 

95.69 

95.83 

crx 

85.51 

84.78 

84.20 

haberman 

66.65 

68.29 

* 

ionosphere 

90.90 

91.46 

92.88 

pima 

73.16 

76.16 

72.12 

spectf 

82.37 

79.35 

80.12 

Overall 

83.13 

82.78 

82.40 


* Entropy MDL failed to produce separable sets. For the computation of average performance, the value 68.29 for Entropy 
CC was used. 


Table 6. Accuracy of C4.5 on Training Data 


Dataset 

Cutpoint 

Entropy CC 

Entropy MDL 

heart 

87.05 

88.70 

88.21 

australian 

90.80 

90.73 

89.06 

hepatitis 

88.87 

89.68 

90.48 

horse-colic 

89.07 

89.27 

90.90 

boston housing 

88.78 

89.82 

90.32 

Wisconsin breast 

96.71 

97.03 

96.89 

crx 

89.82 

90.07 

88.98 

haberman 

75.98 

77.04 

75.00 

ionosphere 

94.59 

96.08 

95.65 

pima 

81.32 

82.36 

81.12 

spectf 

92.51 

92.88 

92.50 

Overall 

88.68 

89.42 

89.01 
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Table 7 . Accuracy of Naive Bayes on Training Data 


Dataset 

Outpoint 

Entropy CC 

Entropy MDL 

heart 

85.48 

85.56 

84.57 

australian 

87.79 

87.90 

86.74 

hepatitis 

87.74 

88.55 

88.23 

horse-colic 

83.83 

84.03 

84.51 

boston housing 

81.87 

82.71 

81.67 

Wisconsin breast 

97.28 

97.46 

97.50 

crx 

87.18 

87.93 

87.32 

haberman 

79.00 

79.57 

74.67 

ionosphere 

92.73 

93.02 

92.17 

pima 

78.97 

80.73 

77.83 

spectf 

82.21 

82.57 

81.64 

Overall 

85.83 

86.37 

85.17 


Table 8. Accuracy of SMO on Training Data 


Dataset 

Outpoint 

Entropy CC 

Entropy MDL 

heart 

88.86 

88.04 

87.63 

australian 

89.24 

90.11 

85.87 

hepatitis 

92.90 

94.19 

91.77 

horse-colic 

90.69 

90.83 

90.83 

boston housing 

93.03 

94.76 

91.05 

Wisconsin breast 

97.71 

98.07 

97.60 

crx 

89.06 

89.56 

88.48 

haberman 

77.37 

78.84 

74.84 

ionosphere 

97.29 

98.29 

99.64 

pima 

80.99 

83.10 

78.26 

spectf 

95.60 

94.94 

93.16 

Overall 

90.25 

90.98 

89.01 
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Prediction Accuracy for Training 
Data 

In some applications, one not only desires high 
accuracy when records of A - A and B - B are 
to be classified, but also wants perfect or near- 
perfect accuracy for the training sets A and B. 
For example, if a diagnostic system for certain 
diseases is trained on sets A and B, then a 
physician may demand that, at a minimum, the 
system handles all training cases correctly. As 
a second example, consider the situation where 
the training data are obtained via a simulation 
process, and where the classification method is 
to produce compact rules that replace a complex 
decision mechanism of the simulation process. 
The user of the classification rules then may 
demand that all training data produced from 
simulation runs must be correctly classified. 

One may evaluate the performance of Outpoint, 


Table 9. Accuracy ofLsqcc on Training Data 


Dataset 

Outpoint 

Entropy CC 

Entropy MDL 

heart 

100.00 

100.00 

99.34 

australian 

99.89 

99.78 

96.60 

hepatitis 

99.68 

99.68 

99.19 

horse-colic 

99.66 

99.73 

98.91 

boston housing 

100.00 

99.85 

96.25 

Wisconsin breast 

100.00 

100.00 

99.57 

crx 

99.86 

99.82 

98.77 

haberman 

89.62 

85.54 

* 

ionosphere 

100.00 

100.00 

100.00 

pima 

100.00 

99.77 

83.73 

spectf 

100.00 

100.00 

98.88 

Overall 

98.97 

98.56 

96.07 


Entropy CC, and Entropy MDL and of the 
four classification methods on training data 
as follows. In each case of the 5-fold cross- 
validation process, 80% of the given data sets A 
and B are used for training, while 20% are used 
for testing. One applies the classification rules 
derived from the training sets to these very same 
sets, and thus obtains for each pair of sets A and 
B five accuracy estimates. The average of these 
five figures is an estimate of the accuracy of the 
classification method for training data. Tables 
6-9 contain these results in the format of Tables 
2-5. 

A first conclusion from Tables 6-9 is that three 
of the four methods fail to achieve perfect or 
near-perfect accuracy for the training data. The 
exception is Lsquare. Indeed, according to Table 
9, the average accuracy for the Cutpoint/Lsquare 
combination is 98.97%. That figure includes the 
comparatively poor accuracy of 88.6% of the 
haberman data set that, with only three attributes 


* Entropy MDL failed to produce separable sets. For the computation of average performance, the value 85.54 for Entropy 
CC was used. 
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and 306 records defies full separation. When that 
set is excluded from consideration, the average 
accuracy of Cutpoint/Lsquare is 99.91%. 

For the other three methods, the highest 
accuracy figures are as follows: 89.42% for 
C4.5, 86.37% for naive Bayes, and 90.98% 
for SMO. Curiously, each of these numbers is 
produced with discretization by Entropy CC. 
When one removes the results for the haberman 
data set and selects the highest accuracy among 
the three discretization methods, one gets the 
following figures: 90.66% for C4.5, 87.05% for 
naive Bayes, and 92.19% for SMO. 

One may argue, quite properly so, that 
classification methods such as C4.5, naive 
Bayes, and SMO are designed to give highest 
accuracy for testing data, and that they are not 
optimized for accuracy on training data. Thus, 
it would be interesting to see how the methods 
may be modified so that simultaneously perfect 
or near-perfect accuracy for training data and 
high accuracy for testing data are achieved. 
Some condition needs to be imposed on the 
construction to rule out trivial modifications such 
as making the training data part of the output 
of the classification method, and enlarging the 
classification rule so that one first checks if the 
record is part of the training set. For example, 
one may impose the restriction that the size of 
the encoding of the entire classification rule 
must be logarithmic in the size of the encoding 
of the training sets. 

The additional requirement of perfect or near- 
perfect accuracy for training data likely influences 
the choice of the discretization method. Thus, 
for a revised classification method, it is quite 
possible that Cutpoint or Entropy CC lead to 
better results than Entropy MDL. 

Optimized Classification 

The results implicitly assume that costs for 
obtaining attribute values of records need not 
be considered. When such costs are important, 
the entire classification approach must be 
reconsidered. For the sake of discussion, we 


assume the setting described in Truemper (2004), 
which is as follows. 

Tests T., T T are available to obtain attribute 
values. In particular, when test T. is performed, 
then a specified subset of attribute values is 

obtained. Each test T carries a certain cost. In 

j 

the general case, a test may produce rational, 
nominal, or logic data, except that the value 
Absent is not possible. On the other hand, the 
value Unavailable is allowed. A test produces 
that value if it has been determined that the 
attribute cannot or should not be obtained. 
Classification of a record into one of the 
populations A and B is done as follows. Initially, 
some entries are given, and the remaining entries 
are equal to Absent. The classification method 
recursively decides if it should declare the 
record to be in A or B, or if it should carry out 
one of the tests to get additional entries. In the 
first case, the methods stops with the declaration 
that the record is classified into A or B. In the 
second case, the method selects a test, requests 
that the test be carried out, adds the new values 
to the record, and invokes recursion. 

The goal is classification with an accuracy that 
is above a given lower bound, and that, subject 
to that condition, involves minimum or close- 
to-minimnm total cost of tests. Depending on 
the setting, one may also impose the additional 
condition of perfect or near-perfect accuracy on 
the training data. We call any scheme that carries 
out this recursive process and that achieves the 
desired goal an optimized classification process. 
At this time, it is largely open how the various 
classification methods in existence should be 
modified so that they can carry out optimized 
classification. For Lsquare, the solution via so- 
called optimized formulas is given in Truemper 
(2004). We omit details here, but cover the 
adjustment needed for Cutpoint. Instead of 
enforcing at least one marker per attribute, we 
now require at least k markers, where k is a small 
positive integer. Then Lsquare is applied with the 
so-called optimized formula option. Typically, 1 
< k < 6 is appropriate. The specific choice may be 
obtained by trying several values and selecting 
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one so that the classification of testing data is 
sufficiently accurate and can be done at low total 
test costs. If such a trial-and-error process is not 
possible, k = 2 or k = 3 is likely to work well. 
Regardless of the choice of k, the method always 
achieves perfect accuracy on the training data, 
assuming that Cutpoint produces fully separable 
sets of logic data. 

CONCLUSION 

The chapter introduces the Cutpoint method for 
discretization of rational data and compares the 
scheme with two entropy-based methods called 
Entropy CC and Entropy MDL. According to 
tests, Cutpoint seems best when the classification 
is done by naive Bayes methods or learning 
logic methods, while Entropy MDL appears to 
be best for decision tree methods and support 
vector machines. The performance differences 
are fairly small so that, for specific cases, one 
may want to apply each of the three methods and 
select the one giving best results. The chapter 
also discusses the choice of discretization 
methods when perfect or near-perfect accuracy 
is desired for training data, or when costs of 
obtaining data are to be considered in so-called 
optimized classification processes. 
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ABSTRACT 

Traditional classification approaches consider a dataset formed by an archive of observations classi- 
fied as positive or negative according to a binary > classification rule. In this chapter, we consider the 
financial timing decision problem, which is the problem of deciding the time when it is profitable for the 
investor to buy shares or to sell shares or to wait in the stock exchange market. The decision is based 
on classifying a dataset of observations, represented by a vector containing the values of some financial 
numerical attributes, according to a ternary’ classification rule. We propose a new technique based on 
partially defined vector Boolean functions. We test our technique on different time series of the Mibtel 
stock exchange market in Italy, and we show that it provides a high classification accuracy, as well as 
wide applicability for other classification problems where a classification in three or more classes is 
needed. 


INTRODUCTION 

In the area of knowledge-based expert systems, 
the aim is to detect structural information from 


large datasets in order to extract salient features for 
identifying differences that separate one set of data 
from another. Classification methods developed in 
the literature try to classify the given observations 
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and, in addition, to classify new observations in 
a way consistent with past classifications. Such 
structural information can provide powerful 
means for the solution of a variety of problems, 
including classification, automated knowledge 
acquisition for expert systems, development of 
pattern-based decision support systems, detection 
of inconsistencies in databases, feature selection, 
medical diagnosis, marketing, and numerous 
aspects of etiology. 

Several approaches coming from different 
fields have beenproposed in the literature to tackle 
the classification problem. One of the best-known 
methods is support vector machines (SVM) that 
has proved highly successful in a number of clas- 
sification studies. Although the subject traces its 
origin to the seminal work of Vapnik and Lerner 
(1963), it is onlynowreceiving agrowing attention. 
In the simplest case, given a set of observations 
classified into two classes, the aim is to construct 
a function to discriminate between classes. This 
can be done via a mathematical programming 
approach. A linear programming-based ap- 
proach, stemming from the multisurface method 
of Mangasarian (1965, 1968), has been used for 
a breast cancer diagnosis system (Mangasarian, 
Setiono, & Wolberg, 1990; Mangasarian,, Street, 
& Wolbcrg, 1995; Wolberg, & Mangasarian, 1990). 
Another approach is the quadratic programming 
method based on Vapnik’s Statistical Learning 
Theory (Cortes & Vapnik, 1995; Vapnikl995). 
See Burges (1998) for a tutorial on classification 
via SVMs. Bredensteiner and Bennett (1999) show 
how the linear programming and quadratic pro- 
gramming methods can be combined to yield two 
new approaches for the multiclass problem. Other 
mathematical programming techniques, based on 
the minimization of some function measuring the 
classification error (Freed & Glover, 1986; Glover, 
1990; Kamath, Karmarkar, Ramakrishnan, & 
Resende, 1992; Triantaphyllou, Allen, Soyster, 
& Kumara, 1994), have been used in classifica- 
tion problems. A MIN SAT approach for learning 
logic relationship that correctly classify a given 


dataset has been recently proposed in Felici and 
Truemper (2000). 

Decision trees are another popular technique 
for classification. The main reason behind their 
popularity seems to be their relative advantage 
in terms of interpretability. There are several ef- 
ficient and simple implementations of decision 
trees (Quinlan, 1993). In a recent work, Street 
(2004) presents an algorithm based on nonlinear 
programming for multicategory decision trees. 
Unfortunately, one of the limitations of most deci- 
sion trees is that they are known to be unstable, 
especially when dealing with large data sets (Fu, 
Golden, Lele, Raghavan, & Wasil, 2003). In the 
literature, there are several papers that provide 
heuristics and metaheuristics for the problem of 
finding an optimal decision tree, which is known 
to be an NP-complete problem (Fu, et al., 2003; 
Niimi & Tazaki, 2000). 

Naive Bayes method is another simple but ef- 
fective classifier (Jefferys & Berger, 1992; Yeung, 
1993). The attributes, observed in the training 
set, are assumed to be conditionally independent, 
given the value of the class attribute. In order to 
derive a good classification rule, and considering 
the independence assumption made, the marginal 
probabilities of each attribute must be estimated. 
In Lin (2002), it is shown that the asymptotic target 
of support vector machines is some interesting 
classification functions that are directly related 
to the Bayes rule. Actually, the independence as- 
sumption is unrealistic, thus, Bayesian networks 
have been introduced that explicitly model de- 
pendencies between attributes (Pernkopf, 2005). 
Thus, given a set of observations, the problem is 
to find a network that best matches the training 
set. The search for the best network is based on a 
scoring function that evaluates each network with 
respect to the training data (Fleckerman, Geiger, 
& Chickering, 1995; Lam & Bacchus, 1994). 

Several classification problems can also be 
formulated as an artificial neural networkproblem. 
An artificial neural network (ANN) canbe thought 
of as a mathematical paradigm that models the bio - 
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logical neural system. The theory and the design 
of ANNs has significantly developed during the 
last 20 years. The increasing interest for ANNs 
is mainly due to their ability to learn both from 
supervised and unsupervised datasets. ANNs are 
very well suited for solving large classification 
problems (Archer & Wang, 1993; Bishop, 2004; 
Boulle, Chandramohan, & Weller, 2001; Coakley 
& Brown, 2000). There exist several architectures 
for dealing with real applications, and different 
algorithms and methods are used for training a 
neural network. For instance, the backpropaga- 
tion algorithm (Archer & Wang, 1993; Lawrence 
& Giles, 2000; Ooyen & Nienhuis, 1992) and the 
multisurface method of Mangasarian are often 
used for the training phase. 

In this chapter, we propose a new classification 
technique, based on combinatorics, optimization, 
and partially defined vector Boolean functions, 
which is suitable when the archive of observa- 
tions is, in particular, classified in more than two 
classes. Our method relates to the logical analysis 
of data (LAD), that is, a classification method 
proposed in (Hammer, 1986). LAD was first ap- 
plied to the classification of binary datasets when 
a binary classification rule was adopted (Boros, 
Hammer, Ibaraki, Kogan, Mayoraz, & Muchnick, 
2000). This method assumes that the archive of 
observations can be naturally represented by a 
partially defined Boolean function. The goal of 
LAD is to obtain an extension of the partially 
defined Boolean function, that is, a completely 
defined Boolean function that represents the clas- 
sification of all the vectors in the sample space. 
The central concepts used by LAD are those of 
prime implicants, which are special logical con- 
junctions of literals imposed on the values of the 
attributes in the dataset. The aim is to generate 
a set of prime implicants for finding a suitable 
minimal disjunctive normal form (DNF) (Crama, 
Hammer, & Ibaraki, 1988) representation of a 
Boolean function that allows both to describe 
the archive, and to correctly classify all known 
and most new observations. Such a minimal DNF 


provides an extension of the partially defined 
Boolean function. A straightforward extension of 
the LAD technique when an archive S is classified, 
for instance, into three classes, S t , S 2 and S 3 , can 
be obtained by first finding a minimal DNF that 
separates the class S from S 2 n S , and then by 
referring to a second minimal DNF for classify- 
ing a new observation either in S or in S y The 
disadvantage of this approach is that the two DNF 
so obtained may not capture all the relationships 
between the attributes and those salient features 
that separate one set from another. 

We apply our classification technique to the 
financial timing decision problem, which is the 
problem of deciding, at each time period, if it is 
profitable for the investor to buy shares, or to sell 
shares, or to wait in the stock exchange market. 
The decision is based on finding a ternary clas- 
sification of a dataset of observations of the stock 
exchange course. Each observation in the dataset 
is represented by a vector containing the values 
of some financial numerical attributes (Murphy, 
1997) like, for instance, the relative strength index 
(RSI), the rate of change (ROC), the stochastic 
oscillators (SO), and so forth. More precisely, we 
classify as positive those observations that refer 
to periods when it is profitable for the investor to 
buy shares, negative when it is profitable to sell 
shares, and null when it is a waiting period. In 
this chapter, we develop an approach that pro- 
duces a vector disjunctive normal form (VDNF) 
representation that allows one to classify a new 
observation directly in one of the sets S , S 2 or 
S y In fact, our archive S can be represented by 
a partially defined vector Boolean function \|/: 
S — >{0,1} 2 in such a way that for a given obser- 
vation seS, we have seS , seS, and ss S } if and 
only if \|/(s)=(l,l), \|/(s)=(0,0) and \j/(s)=(l,0) or 
\j/(s)=(0,l), respectively. We then proceed to find 
a simple VDNF representation consistent to this 
classification, that is, to find a simple extension 
of \|/. Here, the simple requirement means that 
we want to find a VDNF as short as possible. 
The VDNF embeds the structural information of 
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the dataset so that we can correctly classify new 
observations and find out indications about the 
interpretation of the financial phenomenon under 
study. Both exact and heuristic procedures are 
used for generating a simple VDNF. Our tech- 
nique has a wide applicability for classification 
problems where a classification in three or more 
classes is needed. 

We provide the effectiveness of our method by 
means of numerical experiments. We use three 
time series of the Mibtel stock exchange market 
in Italy over a period of 1 year. Each time series 
reports the open, close, high, and low daily prices. 
We generate the attributes of the financial timing 
decision problem by computing four technical 
oscillators (Murphy, 1997): the Relative Strength 
Index, the Difference on Average based on 2 days 
moving average, the Rate of Change, and the Dif- 
ference on Average based on a period of 3 days. 
By means of graphical and numerical analysis, we 
divide the set of observations into three classes: 
S |5 the set of the daily oscillators values suggest- 
ing to buy shares (positive observations); S 2 the 
set of values suggesting to sell shares (negative 
observations); and S 3 the set of values indicating 
a waiting period (null observations). We show that 
our method provides a high classification accu- 
racy. We also compare our technique with some 
standard classification methods. In particular, we 
compare it with support vector machines, decision 
trees, naive Bayes, neural networks, and linear 
and quadratic discriminant analysis. 

The remainder of the chapter is organized as 
follows: Section 2 provides some notation and the 
main results of the chapter, Section 3 presents an 
example and describes the application considered 
in this chapter. Section 4 reports some experi- 
mental results along with the comparison analysis 
with other classification methods, and finally, in 
Section 5, conclusions and further research issues 
are discussed. In the Appendix, we provide a short 
description of the financial attributes considered 
in our application. 


NOTATION AND MAIN RESULTS 

The input information to be classified is repre- 
sented by an archive S of financial observations. 
We assume that each observation is represented 
by a d-dimensional vector containing the values 
of some financial numerical attributes (technical 
oscillators), where d is the number of attributes 
considered. We propose to have an initial clas- 
sification of S into three classes, and we denote 
by S t the positive class, by S 2 the negative class, 
and by S 3 the null class. In financial problems, 
these classes refer to periods when it is profitable 
to buy shares, or to sell shares, or to wait in the 
stock exchange market. That is, an observation 
that refers to a particular day, and that is classified, 
for instance, in S , indicates that it is profitable to 
buy shares on that day because the next day the 
same operation will not be as profitable as the 
day before, that is, there could be an increase in 
the stocks’ prices. Hence, each daily observation 
forecasts what will happen in the stock exchange 
market. 

Our classification technique is based on a 
binary representation of the attributes. In fact, a 
hinarization procedure, consisting of the trans- 
formation of numerical (real valued) data to 
binary (0,1) ones, can always be implemented. 
This procedure can be performed by referring 
to a single cut-point method, or to an interval 
cut-point method (see (Boros, Hammer, Ibaraki, 
& Kogan, 1997; Boros, et al., 2000) for details). 
The cut points will be chosen in a way that allows 
one to distinguishbetweenpositive, negative, and 
null observations. A set of cut-points is consistent 
if in the resulting binarized archive S fl S=0, 
Sj fl S=0, and S 2 fl S=0. Hence, our binarized 
archive S can be represented by a partially de- 
fined vector Boolean function (pdVBf) \\i: S— > 
{0,1} 2 in such a way that for a given observation 
seS, we have seS , seS 2 and seS if and only if 
\|/(s)=(l,l), \|/(s)=(0,0), and\|/(s)=(l,0) onj/(s)=(0,l), 
respectively. When we introduce the largest set of 
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cut-points in the binarization process, the resulting 
binarized archive S is called Master pdVBf. 

Partially Defined Vector Boolean 
Functions and Their Extension 

A vector Boolean function is a mapping (|):{0,l} n 
— » {0,l} m where xg{0,1}" is a Boolean vector. In 
the sequel we will consider mostly the case m= 2 . 
The extension to the general case is straightfor- 
ward, but it requires heavier notation. Following 
the notation introduced previously, we denote 
by Sj((j)) the set of all Boolean vectors x such that 
ij)(jc)=(l,l), S 2 (4>) the set of all Boolean vectors x 
such that ij)(x)=(0,0), and S 3 (i j>) the set of all Bool- 
ean vectors x such that ())(x)=(l,0), or 4>(x)=(0,l). A 
partially defined vector Boolean function (pdVBf) 
\j/ is defined by a triple of sets (Sj, S 2 , S 3 ) such that 
S, denotes a set of positive examples, S 2 denotes a 
set of negative examples, and S denotes a set of 
null examples, that is, \|/ is defined on SjUS 2 uS 3 
and \|/(jc)=(1,1) if jcgS i; \|/(jc)=(0,0) if xgS 2 , and 
i|/(x)=(l,0) or (0,1) if xgS 3 . 

We call a function (() an extension of the pdVBf 
V(S,, S 2 , S 3 ) if Sj g Sj(< |>), S 2 g S 2 (4»), and S 3 g S 3 (i)>). 
Referring to our archive of observations vp (S ( , S 2 , 
S ), our goal is the determination of an extension 
()) that agrees with the classification in the archive 
S and that represents the classification of all the 
vectors in the sample space. 

Referring to the theory of the Boolean func- 
tions, given a partially defined scalar Boolean 
function cp: S c{0,l}" — »■ {0,1}, necessary and 
sufficient conditions for the existence of an exten- 
sion / : { 0 , 1 }"— > { 0 , 1 } in different subclasses of 
scalar Boolean functions are provided in Boros, 
et al. (Boros, Ibaraki, & Mikino, 1998). We will 
denote by/a scalar Boolean function and by <() a 
vector Boolean function. 

A simple extension of the arguments in Boros, 
et al. (1998) provides the following necessary 
and sufficient condition for the existence of an 
extension ij) of a partially defined vector Bool- 


ean function \\i in the class of all vector Boolean 
functions. 

Fact. A partially defined vector Boolean function 
v|/(S |( S , S 3 ) has an extension in the class of all 
vector Boolean functions if and only if (S fi SJ 
= 0, (S, fl S 3 ) = 0, and (S 2 fi S 3 ) = 0. 

Other classes of Boolean functions are of 
interest. For instance, given a scalar Boolean 
function / : {0,1}"— » {0,1}, /is positive ifx <y 
always implies f(x) < f(y) (Boros et al., 1998). In 
the case of a vector Boolean function (j>:{0,l} n — > 
{0,l} m , 4> is positive if x <y implies <|)(x) < ^(y), 
that is, <j> (x) < (() (y), for all z=l,...,m. Therefore, 
a positive vector Boolean function tj* has all the 
scalar Boolean functions <)>., positive. 

As in Boros, et al. (1998), we give necessary 
and sufficient conditions for the existence of an 
extension (j> of a pdVBf \|/(S , S , S 3 ) in the class 
of positive vector Boolean functions: 

Theorem 1. A pdVBf \p(S 3 , S 2 , S 3 ), has an exten- 
sion ()) in the class of positive vector Boolean 
functions if and only if for all x g S i; y g S 2 , and 
z g S , we have x hy, xhz, and z hy. 

Proof. Necessity. Suppose that there exists a pair 
x < y with x g S and y g S 2 (or x < z with z g 
S 3 or z < y), then 4 >(jc) < <|)(y), by positivity of 4*, 
contradicting <|>(jt)=(l,l) and i()(y)=(0,0). 

Sufficiency. Define the following vector Bool- 
ean function <)>: 

SJ ())) = {c g B" | c >x for some x g SJ 
ST<t>) = {b g B" I b < y for some y g S,} 

The function ij) is positive. Moreover, ij) is a 
positive extension of \\i. From the positivity of ()>, it 
follows immediately that S cz S/c))), and S 2 c S 2 (i))). 
In order to complete the proof, it only remains to 
show that (S 3 fi (S^J) n S 2 (4>))) = 0. Indeed, if there 


28 


Vector DNF for Datasets Classifications: Application to the Financial Timing Decision Problem 


exists z e (S ,(<()) n S 3 ), then we have z e S 3 , and 
there exists x e Sj such that x < z, contradicting 
the assumption x h z. A similar argument shows 
that S 3 fl S 2 (4>) = 0. Hence, S 3 c S 3 (c|)), and this 
completes the proof. 

In this chapter, given our archive of observa- 
tions, represented as a pdVBf, we consider the 
problem of finding an extension (j) as short as 
possible in the class of all vector Boolean func- 
tions. We first need to introduce some classical 
definitions from Boolean algebra. 

The Boolean variables x,, x,,..., x and their 
complements x\, x’ 2 ,..., x’ are called literals. A 
term T is a conjunction of literals such that at 
most, one of x. and x’. appears for each variable. 
A minterm is a maximum length conjunction 
term, that is, it contains all variables normal (un- 
complemented) or complemented (Schneeweiss, 
1989). A disjunction of conjunctions T defines 
a disjunctive normal form (DNF). In the case 
of scalar Boolean functions, a DNF defines a 
function, and it is well-known that every scalar 
Boolean function / can be represented as a DNF 
(Schneeweiss, 1989); however, such a representa- 
tion may not be unique. The goal is to find the 
minimal DNF representation of a scalar Boolean 
function f To make the minimality requirement 
more precise, we need to introduce some further 
definitions. 

A term T is an implicant for a given scalar 
Boolean function / if T = 1 implies /= 1. An 
implicant T is a prime implicant for the scalar 
Boolean function /if any term obtained by drop- 
ping a literal from it is not an implicant (Boros 
et al., 1997). 

Prime implicants are the fundamental blocks 
for generating an extension of a given partially 
scalar Boolean function. Indeed, the minimal 
DNF representation of a scalar Boolean function 
is obtained when its terms T are prime implicants 
(Crama & Hammer, 2002). The concepts of im- 
plicants and prime implicants can be generalized 
in the case of vector Boolean functions. 


Definition 1. (McCluskey, 1986). Given a vector 
Boolean function <)>: {0, 1} "— >■ {0, l} m , of components 
cjjj, </,..., <j) , a term Tis a multiple implicant (resp. 
multiple prime implicant) of (J) if: 

1. It is either an implicant (resp. prime im- 
plicant) of one of the functions (j)., 

or; 

2. It is an implicant (resp. prime implicant) of 

one of the product (conjunction) function 

Vf c A’ 1 

Here we introduce the concept of a vector 
disjunctive normal form VDNF as a vector of m 
components, each corresponding to a single DNF. 
A vector Boolean function may be represented 
by a VDNF. Notice that, by considering the class 
of positive scalar Boolean functions introduced 
above, it is well-known that ascalar Boolean func- 
tion /is positive if and only if f can be represented 
by a DNF in which all the literals of each term 
are uncomplemented (Boros, etal., 1998). Hence, 
by referring to the definition of positive vector 
Boolean functions, a vector Boolean function (j) 
is positive if and only if it can be represented by 
a VDNF whose components are represented by 
DNFs in which all the literals of each term are 
uncomplemented. We want to find a short repre- 
sentation of a vector Boolean function ()) in the 
sense described by the following theorem. 

Theorem 2. (Existence of a short VDNF (Mc- 
Cluskey, 1986)). There exists a short representa- 
tion of a vector Boolean function (j), in which 
each component ij) is the disjunction of multiple 
prime implicants of ()) , such that, all the terms 
that occur only in the expression for (j). are prime 
implicants of (j).; all the terms that occur in both 
the expressions for / and <|r with i ^ j but in no 
other expressions are prime implicants for (j). • <(r; 
and so forth. 
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Multiple Logic Minimization Problem 

It was already noticed that a scalar Boolean func- 
tion / :{0,l} n — > {0,1} may have numerous DNF 
representations (Schneeweiss, 1989). In many 
applications a short DNF representation of / is 
preferred over a longer one (Crama & Hammer, 
2002). The problem of constructing a DNF rep- 
resentation as short as possible is referred to as 
the logic minimization problem. The same prob- 
lem arises for a vector Boolean function ()>:{0,l} n 
— > {0,l} m . That is, we search for a short VDNF 
representation of ()) with m = 2 in our application. 
Perhaps, the most obvious technique is to find a 
short representation of every component (DNF) of 
the VDNF. Unfortunately, some simple examples 
show that this method does not necessarily lead to 
a short VDNF representation (McCluskey, 1986). 
The problem of finding a short VDNF is usually 
referredto asth e multiple logic minimization prob- 
lem. In our work, we refer to the Quine-McClus- 
key approach for solving it (McCluskey, 1986). 
This approach is divided into two phases. In the 
first phase, a prime implicant table is generated. 
For each (binarized) observation s in the Master 
pdVBf \|/(S |( S 2 , S 3 ), along with its classification 
t|/(s)=(t|/ 1 , t|/ 2 ), a minterm and a multiple prime 
implicant for \j/(s)=(\}/ 1 , t|/ 2 ) are found. That is, 
for each observation s, a minterm for \]/ and t|/ 2 , 
and a prime implicant for \|/ 15 t|/ 2 and \]/ • \|/ 2 are 
generated. Each row of the prime implicant table 
corresponds to a multiple prime implicant, while 
each column corresponds to a minterm. 

In the second phase, the Quine-McCluskey 
approach solves a set-covering problem by finding 
the minimum number of rows that covers all the 
columns. It is well-known that the set-covering 
problem is NP-complete, and therefore, this ap- 
proach to multiple logic minimization does not 
provide a polynomial algorithm. In order to solve 
the set-covering problem, a reduction of the table 
is performed. For this, let us first recall the fol- 
lowing definitions from McCluskey (1986): 


Definition 2. A multiple prime implicant of a 
vector Boolean function <j> = (<|> , ij) 2 ,..., ((ij is es- 
sential for a function ijr if there is a minterm in 
the representation of ()>. that is included in only 
one multiple prime implicant. 

Definition 3. Let ij) = (t^, (j> 2 ,..., (j)J be a vector 
Boolean function and let P t , P 2 ,..., P t be the cor- 
responding set of multiple prime implicants. Then, 
a minterm for a function <j) f is a distinguished 
minterm if and only if it is included in only one 
conjunction of literals that is a multiple prime 
implicant of (|)., or of any of the product (conjunc- 
tion) of functions involving (j).. 

Definition 4. A multiple prime implicant of a vec- 
tor Boolean function (j) = (<|) , <j> 2 ,..., <j> m ) is essential 
for a given function (j). if and only if it includes a 
distinguished minterm of (j).. 

In the prime implicant table, rows and columns 
corresponding to essential multiple prime impli- 
cant and distinguished minterm are called essen- 
tial rows and distinguished columns, respectively. 
The reduction of the table is first performed by 
deleting all the essential rows and dstinguished 
columns (McCluskey, 1986). The table so obtained 
can be further reduced by applying the essential 
reducing algorithm (ERA) described in Crama 
and Hammer (2002). The final reduced table A* 
is the matrix of the corresponding set-covering 
problem. Given a set-covering matrix A, in Crama 
and Hammer (2002), it is proved that the ERA 
algorithm applied to A preserves the solutions of 
a set-covering problem. For further details about 
the algorithm for finding a short VDNF representa- 
tion of a vector Boolean function <|), the interested 
reader canreferto McCluskey (1986). However, in 
the next section we provide a numerical example 
for finding an extension of a pdVBf. 


30 


Vector DNF for Datasets Classifications: Application to the Financial Timing Decision Problem 


A NUMERICAL EXAMPLE 

In this section, we provide a numerical example 
for the computation of a VDNF. Let us consider 
the Master pdVBf given in Table 1, where s. is a 
binarized observation, i=l,...,14 and x, j=l,2,3,4 
are the cut-points. 

Our aim is to find an extension of the above 
Master pdVBf as short as possible. By applying 
the first step of the Quine -McCluskey procedure, 
we can find an extension of the pdVBf of Table 
1 by generating all the prime implicants both 
for and \|/ and for the function \|/ x \j/ (see 
Table 2). 

The resulting extension is: 


\|/j =P,vP 2 vP ] vP 5 vP 6 vP 7 = 

= (Xj ■ x’ 2 ) V (x, • X 4 ) V (x 2 ■ X 4 ) v (x\ • X 2 • X 4 ) V (x 3 

• *4) V (* 2 • X 3 ) 

V 2 =P 4 v P 5 v P 6 v P 7 = X 3 v ( X 'r X 2 • -*4) v ( X 3 ' X 4) 
V (x ’ 2 • X 3 ). 

This extension may not be a short one, so 
then, we apply the second step of the Quine-Mc- 
Cluskey procedure in order to eliminate all the 
redundancies. The first step is to select all the 
essential prime implicants (see Definitions 2-4). 


Table 1. Partially defined vector Boolean function 
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Table 2. Prime implicants 


Prime Implicants for vy. 

Prime Implicants for \p 2 

Prime Implicants for \|/ 1 x \| / 2 

P T(*1 Xx \) 

P = V 

r 4 

P 5 =(x’iXX 2 X Xj ) 

P 2 = (*2 * * 4 ) 


P 6 =(x’ 2 xx 3 ) 

P 3=( X 1 XX 4) 


P 7 =(x 3 xx 4 ) 
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Table 3. The covering table 
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Table 4. The second covering table 
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For \|/ 1 they are P and P h , while for \\i 2 they are P 4 
and P.. We delete from Table 2 all the minterms 
covered by the essential prime implicants. The 
remaining minterms to be covered are shown 
in Table 4. 

The prime implicant P 2 in Table 4, dominates 
the other ones for v|/ |5 that is, it covers all the mint- 
erms covered by P 3 , P_, P 7 and other minterms not 
covered by P„ P 5 , P ? . It also completes, with the 
essential prime implicants P, and P 6 , the covering 
of the component \p 1 of the table. 

The resulting extension is: 

V, = Pi V P 2 v P 6 = (Xj • x’ 2 ) v (x 2 ■ X 4 ) V (x’ 2 ■ X 3 ) 

\\f 2 =P 4 'vP 5 = x 3 \/(x\-x 2 - x 4 ) 

which is shorter than the first extension found. 


In fact, \|/ contains three terms less than before, 
while \|/ 2 contains two terms less than before. 
Alternatively, a short extension can be obtained 
by solving the following set-covering problem: 

min Sj + s 2 + s 3 + s 4 

s.t. 

Si + s 3 >l 
Sj + S 3 +S 4 > 1 
Si + S 2 >l 
s = {0,1}, i=l,2, 3 

whose optimal solution is (1, 0, 0, 0) which refers 
to the same dominating prime implicant P ,. 

THE APPLICATION 

In our application, we refer to three time series 
of the Mibtel stock exchange market for the 
years 1999-2001. We considered these time se- 
ries because they do not present a well-defined 
primary trend and, for each attribute, they show 
a considerable presence of lateral movements, or 
the so called sideways. 

Each observation is related to the open, close, 
high, and low daily prices of the Mibtel stock 
exchange market. We use both graphical and 
algorithmic tools of the financial technical analy- 
sis to generate numerical attributes or technical 
indicators. The interpretation of these indicators 
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allows us to classify our dataset into the three 
classes S t , S 2 and S 3 . These indicators are obtained 
by computing the moving averages defined on 
some combinations of the open, close, high, and 
low prices. There are several indicators that can 
be considered by a financial expert to predict fu- 
ture movements in the market, and there is not a 
standard criterium for choosing a given subset of 
indicators. In the present work, we use a particu- 
lar type of technical indicator called Oscillators. 
The oscillators are particularly useful when the 
market does not present a well-defined trend that 
can be used to identify situations of oversold and 
overbought. An overbought or oversold condition 
merely indicates that there is a high probability 
of a reaction in the market. These conditions sug- 
gest that there could be an opportunity to buy or 
sell securities. Just one oscillator is not able to 
provide information about the market situation. 
In the Appendix, we describe the characteristics 
as well as the meaning of the oscillators we used 
in our application. 

EXPERIMENTAL RESULTS 

This section reports on numerical experiments 
with the use of the VDNF technique for clas- 
sification. We consider the three time series 
(1999-2001) as the initial classified archive. It is 
composed of 747 observations, where | SJ = 53, 

| S 2 | = 66, and | S 3 | = 379 are the cardinalities of 
the three classes, respectively. In our experiments, 
we used as training sets both the 1999 time series 
only, representing 33% of the observations of the 
entire archive, and the 1999-2000 time series, 
representing 66% of the whole archive. Then, 
the accuracy of our method was tested on the 
complement of the two training sets. Since most of 
the observations in the archive are in the S 3 class, 
we did not use larger training sets. Otherwise, 
we would have obtained a test set with very few 
observations in one of the two classes S and S , 
and this would have resulted in a bad classifica- 


tion accuracy. Moreover, in contrast with other 
classification applications, we did not estimate 
the effectiveness of our technique by randomly 
selecting a subset of the dataset as training set. 
This is due to our particular application. In fact, 
for financial applications, the aim is to predict 
whether the market will continue to go up or 
down, that is, whether it is in an oversold or in 
an overbought condition. Thus, the data included 
in a training set must represent a continuum of 
observations for a given time period. However, 
determining the appropriate time period for his- 
torical inputs is one of the biggest challenges in 
financial forecasting. 

We present a comparison of our VDNF ap- 
proach with some standard classification methods. 
In particular, we compared our technique with 
support vector machines, decision trees, and naive 
Bayes. For this, we use the implementation of 
these techniques contained in the Weka software 
(Witten & Frank, 2005). Moreover, classification 
results provided by neural networks and linear 
and quadratic discriminant analysis are reported 
as well. These last methods were implemented in 
the Matlab 7.0 environment. All the experiments 
were performed on a PC AMD Athlon 2500+ 
GHz. More details on the implementation of the 
classification methods compared are given as 
follows: 

Support vector machines (SVM): We obtained 
support vector classifiers by using both a radial 
basis function (RBF) and a polynomial function 
as kernels. Furthermore, parameter-selection 
strategies have also been considered. That is, we 
searched for the appropriate value of a param- 
eter C, which controls the trade-off between the 
classifier capacity and the training errors, and of 
parameters y and d in the two kernel functions. 
For instance, in the case of RBF, we find the best 
values of the parameters (C, y) to be used in the 
classification algorithm by searching on the fol- 
lowing two grids of possible points for (C, y). 
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We first consider the grid [MO 4 , MO 3 
MO 4 ] x [MO 3 , MO 2 ,..., MO 3 ]. Let (C 0 , y 0 ) be 
the pair associated with the best classification 
value. Then, we use the grid [0.2C 0 ,...,8C 0 ] x 
[O.2y 0 ,...,8y 0 ] to select the best pair. Similarly, 
for the polynomial kernel, the best classification 
values were obtained by searching the parameters 
(C, d ) sequentially in the two grids [1T0 -4 , MO" 3 
,..., MO 4 ] %o [1,10,50,100] and [0.2C 0 ,...,8C 0 ] x 
[0.2 d 0 ,0.4 d 0 , 0.6 d 0 ,0.8d 0 ,2d 0 ,3d 0 ,4d 0 ]. These 
grids were considered both when the training 
set consisted of 33% of the dataset, and when it 
consisted of 66% of the dataset. For the RBF, by 
referring to the two percentages of the training 
sets, the pairs that gave the best classification 
values were C -600 and y-6, and C’=20 and 
y —60, respectively. For the polynomial kernel, the 
best pairs were C -600 and d -20, and C -4000 
and d -8, respectively. In Table 5, we report the 
percentages of well-classified instances obtained 
by using the two kernel functions. 

Decision trees: We referred to the popular C4.5 
algorithm implemented in the Weka software 
for constructing decision trees. We tested a set 
of parameters in the range [1,...,20], indicating 
the minimum number of instances to specify in 
a leaf of the tree. For both the percentages of the 
training sets, the best classification was obtained 
with the parameter set to the value 2. Moreover, 


we observed that a pruning strategy did not yield 
betterpercentages of correctly classified instances 
than an unpruned strategy. In Table 5, we report the 
best values of well-classified instances found. 

Naive Bayes: The popular Naive Bayes method 
is another simple but yet effective classifier. This 
method learns the conditional probability of each 
attribute, given the class label from the training 
data. Classification is then done by applying Bayes 
rule to compute the probability of a class value, 
given the particular instance, and predicting the 
class value with the highest probability. A strong 
independence assumption is made, that is, all the 
attributes are assumed conditionally independent, 
given the value of the class attribute. Numerical 
values of the attributes are usually handled by 
assuming that they have a Gaussian probability 
distribution. Since this assumption may be incor- 
rect, we also implemented the method referring 
to the kernel density estimation that does not 
assume any particular distribution for the at- 
tribute values. 

Neural networks: For training a feedforward 
neural network, we used the backpropagation 
algorithm implemented in Matlab 7.0 with the 
regularized mean square error (MSE regularized) 
as performance function. Although there are 
many variants of the backpropagation algorithm 


Table 5. Percentage of well-classified instances with standard classification methods 


Classifier 

33% Training 

66% Training 

Support Vector Machine with RBF 

95.38 

96.78 

Support Vector Machine with Polynomial Kernel 

95.58 

96.38 

Decision Trees 

94.97 

95.79 

Naive Bayes 

91.76 

86.34 

Naive Bayes with kernel estimation 

93.17 

86.54 

Neural Networks 

74.83 

79.52 

Linear Discriminant Analysis 

82.73 

70.60 

Quadratic Discriminant Analysis 

52.81 

75.10 
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in Matlab, we adopted the resilient backpropaga- 
tion training algorithm that is able to eliminate the 
harmful effects of the magnitudes of the partial 
derivatives and is generally much faster than other 
procedures. A validation set was also introduced 
in order to check the progress of training. This 
set was formed by 3 -months observations from 
December 2000 to February 2001. Both with a 
training set formed by 33% of the observations 
and with one formed by 66% of the observations, 
we trained six single-layer and four double-layer 
networks with [50,100,150,200,250,300] neurons 
and [10,15,25,50] neurons per layer, respectively. 
Each network was trained five times with ran- 
domly generated input weights. In Table 5, we 
report only the best classification percentages 
obtained. 

Discriminant analysis: We used the tools for 
the linear and quadratic discriminant analysis of 
Matlab 7.0 that implement Fisher’s Method. 

We now consider the VDNF approach. Given 
the archive of observations, we derive three dif- 
ferent binarized tables, each referring to three 


Table 6. Classification accuracy with 33% training 


different binarization strategies. Namely, the three 
binarized archives are obtained by first apply- 
ing only a single cut-point method, then only an 
interval cut-point method, and finally, by using 
both a single cut-point and an interval cut-point 
method (Boros, etal., 1997; Boros, etal., 2000). We 
report the classification accuracy of our technique 
by referring to each Master pdVBf obtained. In 
general, regardless of the method used, a binary 
encoding of a dataset of observations generates 
a great number of Boolean variables (cut-points). 
In fact, many of the Boolean variables introduced 
by a binarization procedure may not be needed to 
explain the phenomenon. Hence, a size reduction 
is actually necessary in order to prevent insur- 
mountable computational difficulties at the VDNF 
generation stage. Following Boros et al. (2000), 
we reduce the dimension of the Master pdVBf 
by deleting the redundant variables. In order to 
generate the VDNF with the Quine-McCluskey 
procedure, we used two softwares: 

1. ESPRESSO II: It generates the prime im- 
plicant table, as described before, and then 
applies a reduction process to this table. 



Single Cut Point 

Interval Cut-Point 

Single-Interval 

Cut Point 


Dimension 

Classified 

Dimension 

Classified 

Dimension 

Classified 

Espresso II 

249x155 

94.98 

249x146 

88.55 

249x288 

84.54 

Boom 2.3 

249x155 

95.18 

249x146 

94.77 

249x288 

95.18 


Table 7. Classification accuracy with 66%> training 



Single Cut Point 

Interval Cut-Point 

Single-Interval 

Cut Point 


Dimension 

Classified 

Dimension 

Classified 

Dimension 

Classified 

Espresso II 

498x191 

98.99 

498x181 

98.99 

498x346 

95.58 

Boom 2.3 

498x191 

98.99 

498x181 

98.99 

498x346 

97.98 
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Finally, depending on the dimension of 
the resulting reduced table, either an exact 
algorithm or a greedy-like procedure is 
implemented for solving the set-covering 
problem. 

2. BOOM v2.3: It implements a stochastic 
greedy-like procedure for generating the 
prime implicant table. After applying a 
reduction process to this table, a stochastic 
greedy-like procedure is implemented for 
solving the resulting set-covering prob- 
lem. 

In Tables 6 and 7, we report the results when 
the training set consists of 33% of the dataset 
and of 66% of the dataset, respectively. For each 
software used, we present the Master pdVBf 
dimension (i.e., number of observations times 
number of cut-points used) and the percentage of 
the well-classified points in the test set. 

Recalling the definition of the three classes, 
Sj, S 2 and S 3 , we observed that the percentage 
of well-classified observations in the test set 
represents how many times we were able to cor- 
rectly classify the observations in S , S, and S 3 
and therefore, how many times we were able to 
make a profitable operation (to buy, to sell or to 
wait) in the stock exchange market. 

At the beginning of this section we pointed out 
that, from an application point of view, in financial 
forecasting problems it make no sense to apply 
a classification method by randomly selecting a 
subset of the data set as training set. Neverthe- 
less, we next report some experimental results on 


Table 8. Dataset composed by the 1999 time series 


randomly generated training sets with the aim of 
evaluating the efficiency of our technique. 

Let us consider the 1999 time series only as 
dataset. It has 249 observations where | SJ = 34, | 
S | = 27, and | S 3 | = 188 are the cardinalities of the 
three classes in which the dataset is subdivided, 
respectively. We tested the classification accuracy 
of our method by extracting from the 1999 series 
two different samples that were used as training 
sets, each containing about 47% of the observa- 
tions in the archive (119 observations). The ob- 
servations in the samples were chosen according 
to this rule: (1) for each sample, the number r v r 2 
and r of the observations that must be classified in 
each class were firstly decided; (2) from the 1999 
time series we extract, at random, r [ units from 
the observations in S , r 2 units from the observa- 
tions in S 2 and r } units from the observations in 
S y We choose this rule since, by extracting at 
random 119 observations directly from the 1999 
series, it may result in a sample in which one of 
the r., i=l,2,3 could be zero. Also, a training set 
(a sample) with very few observations in one class 
may give a bad classification accuracy. 

In our application, for the first sample we 
considered r= 19, r 2 =12 and r= 88, while for the 
second sample we considered r =19, r 2 =15 and 
r 3 =85. As test set, we used the rest of the 1999 
time series composed of 130 observations. The 
results are reported in Table 8. For each sample, 
we provide the dimension of the Master pdVBf 
generated, and the percentages of the well-classi- 
fied observations in the test set for both software. 
We used only the simple cut-point method for the 
binarization of the numerical dataset. 



Master pdVBf dimension 

Well-Classified 



Espresso II 

Boom 2.3 

First sample 

119x101 

90.77 

91.54 

Second sample 

119x112 

86.15 

92.31 
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CONCLUSION AND FURTHER 
RESEARCH 

In this chapter, we proposed a new classification 
technique that is based on combinatorics, opti- 
mization, and partially defined vector Boolean 
functions. Our technique is concerned with clas- 
sification problems, where the goal is to extract 
salient features from an archive of observations 
in order to separate one set of observations from 
another. We developed a method that can be ef- 
ficiently used when the observations in the archive 
are divided into three (or more) classes according 
to a ternary classificationrule. Inparticular, in this 
chapter, we applied our classification method to a 
financial problem, the financial timing decision 
problem. The classification performance of our 
technique was tested on three financial time series. 
We compared our technique with some standard 
classification approaches. In particular, we com- 
pared it with support vector machines, decision 
trees, naive Bayes, neural networks, and linear 
and quadratic discriminant analysis, obtaining en- 
couraging results. Our technique seems to provide 
a high classification accuracy, as well as a wide 
applicability for classification problems where a 
classification in three or more classes is needed. 
In fact, it outperforms almost all the standard 
classification methods, and compares favorably 
with the SVM classifiers that have proved highly 
successful in a number of classification studies. 
Moreover, it reveals a good explanatory power 
of the phenomenon under study, since a VDNF, 
which makes use of multiple prime implicants, 
better captures the combination between the at- 
tributes considered. 

Of course we are conscious that further re- 
search is needed for a better understanding of the 
mathematical and computational aspect of this 
technique, and also further classification problems 
need to be considered in order to better understand 
the domain of applicability of our method. 
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APPENDIX: THE TECHNICAL OSCILLATORS 
Relative Strength Index (RSI) 

The relative strength index is a well-known oscillator developed by Welles Wilder Jr. RSI measures the 
relative changes between high and low closing prices, and provides an insight of overbought and oversold 
conditions. The term relative strength generally implies a comparison between two different markets or 
indices. RSI provides early “warning signals” (buy or sell signals) if it is used in conjunction with other 
indicators. The relative strength index values canbe plotted on a vertical scale ranging from 0 to 100. The 
70 and 30 values are refered to as warning signals. An RSI value above 70 is related to an overbought 
condition, indicating a (probably) selling period, while a value below 30 refers to an oversold condition, 
indicating a (probably) buying period. The values 80 and 20 are often preferred by some traders. The 
information provided by the RSI depends upon the time interval on which it is computed. The shorter 
the interval, the more sensitive is the information provided by the index. Time intervals of 9, 10, and 
25 days are often considered. Extending the time period makes the oscillator smoother and narrower in 
amplitude. RSI signals should always be used in conjunction with trend-reversal indicator prices. 

Rate of Change (ROC) 

ROC measures the “speed” of the prices in the market. Indeed, ROC is sometimes referred to as the 
price rate of change (PROC). Growing values of ROC indicate a bullish prices increasing period, while 
falling values of ROC indicate a bearish prices decreasing period. The ROC index displays the amount 
of price changes over a given time period. ROC can be represented as a wave. When the wave is above an 
equilibrium line, usually the zero-line, we assume to be in a buying period. When the wave falls below 
the equilibrium line, we assume to be in a selling period. When the wave starts growing from below 
the equilibrium line, we have an indication of a coming bullish period. The symmetric configuration 
is considered a forthcoming bearish period. Like the RSI, the ROC can also be computed referring to 
different time periods. If ROC is computed referring to a 10- or 12-day interval, it is a good short-term 
price indicator. 

Moving Average Convergence/Divergence Trading Method (MACD) 

The MACD method, developed by Gerald Appel, is a trend indicator, telling us whether a stock is in 
an uptrend or in a downtrend (Murphy, 1997). The direction of a long-term trend is the first assess- 
ment one should consider in any market. An uptrend is preferred and indicates a buying period, while 
a downtrend indicates a selling period. The simplest representation of this indicator is composed of two 
lines: the MACD line, which is the difference between two exponential moving averages (EMAs), and 
a signal line, which is an EMA of the MACD line itself. The signal or trigger line is plotted on top of 
the MACD to show buy or sell opportunities. Gerald Appel’s MACD method uses a 26-day and 12-day 
EMA, based on the daily close prices, and a 9-day EMA for the signal line. The basic MACD trading 
rule is to buy when the MACD rises above its signal line. Similarly, a sell signal occurs when the MACD 
crosses below its signal line. If the MACD line is above the signal line, it denotes the beginning of a 
trend. An uptrend typically stops when the MACD line falls below the signal line. 
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ABSTRACT 

The purpose of this chapter is to demonstrate the possibility of transforming a large class of machine- 
learning algorithms into commonsense reasoning processes based on using well-known deduction and 
induction logical rules. The concept of a good classification (diagnostic) test for a given set of positive 
examples lies in the basis of our approach to the machine-learning problems. The task of inferring all 
good diagnostic tests is formulated as searching the best approximations of a given classification (a 
partitioning) on a given set of examples. The lattice theory is used as a mathematical language for con- 
structing good classification tests. The algorithms of good tests inference are decomposed into subtasks 
and operations that are in accordance with main human commonsense reasoning rules. 


INTRODUCTION 

The development of a full online computer model 
for integrating deductive and inductive reason- 
ing is of great interest in machine learning. The 
main tendency of integration is to combine, into a 


whole system, some already well-known models 
of learning (inductive reasoning) and deductive 
reasoning. For instance, the idea of combining 
inductive learning from examples with prior 
knowledge and default reasoning has been ad- 
vanced in Giraud-Carrier and Martinez (1994). 
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Obviously, this way leads to a lot of difficulties 
in knowledge representation because deductive 
reasoning tasks are often expressed in the classical 
first-order logic language (FOL), but machine- 
learning tasks use a variant of simbolic-valued 
attribute language (AVL). 

The principe of “aggregating” dif ferentmodels 
of human thinking for constructing intelligent 
computer systems leads to dividing the whole 
process into two separate modes: learning and 
execution or deductive reasoning. This division 
is used, for example, in Zakrevskij (1982, 1987; 
Zakrevskij & Vasylkova, 1997). This approach is 
based on using finite spaces of Boolean or mul- 
tivalued attributes for modeling natural subject 
areas. It combines inductive inference used for 
extracting knowledge from data with deductive 
inference (the type of theorem proving) for solv- 
ing pattern recognition problems. The inductive 
inference is reduced to looking for empty (for- 
bidden) intervals of Boolean space of attributes 
describing a given set of positive examples. The 
deductive inference relates to the situation when 
an object is contemplated with known values of 
some attributes and unknown values of some oth- 
ers, including a goal attribute. The possible values 
of the latter ones are to be calculated on the base 
of implicative regularities in the Boolean space 
of attributes. In Zakrevskij (2001), the results of 
prolonged research conducted in that direction 
at the Institute of Engineering Cybernetics in 
Minsk are given. 

The fundamental unified model for combining 
inductive reasoning with deductive reasoning is 
developed in the framework of inductive logic 
programming (ILP). ILP is a discipline that in- 
vestigates the inductive construction offirst-order 
clausal theories from examples and background 
knowledge. ILP has the same goal as machine 
learning, namely, to develop tools and techniques 
to induce hypotheses from examples and to obtain 
new knowledge from experience; but, the tradi- 
tional theoretical basis of ILP is in the framework 
of first-order predicate calculus. 


Inductive inference inlLPisbasedon inverting 
deductive inference rules; for example, invert- 
ing resolution (rules of absorption, identifica- 
tion, intraconstruction, and interconstruction), 
inverting implication (inductive inference under 
0-subsumption). 

There is a distinction between concept learn- 
ing and program synthesis. Concept learning and 
classification problems, in general, are inher- 
ently object oriented. It is difficult to interpret 
concepts as subsets of domain examples in the 
frameworks of ILP. One of the ways to overcome 
this difficulty has been realized in a transforma- 
tion approach: an ILP task is transformed into an 
equivalent learning task in dif ferent representation 
formalism. This approach is realized in LINUS 
(Lavrac & Dzeroski, 1994; Lavrac, Gamberger, 
& Jovanoski, 1999), which is an ILP learner- 
inducing hypotheses in the form of constrained 
deductive hierarchical database (DHDB) clauses. 
The main idea of LINUS is to transform the 
problem of learning relational DHDB descrip- 
tions into the attribute-value learning task. This 
is achieved by the so-called DHDB interface. 
The interface transforms the training examples 
from the DHDB form into the form of attribute- 
value tuples. Some well-known attribute-value 
learners can then be used to induce “if-then” 
rules. Finally, the induced rules are transformed 
back into the form of DHDB clauses. The LINUS 
uses already-known algorithms, for example, the 
decision tree induction system ASSISTANT, and 
two rule induction systems: an ancestor of AQ15 
named NEWGEM, and CN2. 

A simple form of predicate invention through 
first-order feature construction is proposed by 
Lavrac and Flash (2000). The constructed features 
are used then for propositional learning. 

Another way for combining ILP with an at- 
tribute-value learner has been developed in Lisi 
and Malerba (2004). In this work, a novel ILP 
setting is proposed. This setting adopts AL-log 
as a knowledge representation language. It al- 
lows a unified treatment of both the relational 
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and structural features of data. This setting has 
been implemented in SPADA, an 1LP system 
developed for mining multilevel association rules 
in spatial databases and applied to geographic 
data mining. 

AL-log is a hybrid knowledge representa- 
tion system that integrates the description logic 
ALC (Schmidt-Schauss & Smolka, 1991) and the 
deductive database language DATALOG (Ceri, 
Gotlob, & Tanca, 1990). Therefore, it embodies 
two subsystems, called structural and relational, 
respectively. 

The description logic ALC allows for the 
specification of structural knowledge in terms 
of concepts, roles, and individuals. Individuals 
represent objects in the domain of interest. Con- 
cepts represent classes of these objects, while 
roles represent binary relations between concepts. 
Complex concepts can be defined from primi- 
tive concepts and roles by applying constructors 
such as n (conjunction), u (disjunction), and -i 
(negation). 

ALC knowledge bases have an intensional 
part and an extensional part. In the intensional 
part, relations between concepts are syntactically 
expressed as inclusion statements of the form C cr 
D where C and D are two arbitrary concepts. As 
for the extensional part, it is possible to specify 
instances of relations between individuals and 
concepts. Relations are expressed as membership 
assertions, for example, concept assertions of the 
form a : C (“a belongs to C”). 

The formal model of conceptual reasoning, 
based on an algebraic lattice, has been obtained 
in two independent ways. One way goes back 
to the works of the great psychologist J. Piaget, 
who introduced the concept of grouping (1959) 
to explain methods of object classification used 
mainly by 7- to 11-year-old children. 

The idea of concepts’ classification as a lattice 
arose from practical tasks of developing informa- 
tion retrieval and pattern recognition systems. In 
1 974, Shreider described the classification algebra 
as idempotent semigroup with the unit element. 


In the same year, Boldyrev (1974) advanced the 
formalization of pattern recognition system as 
algebra with two binary operations of refinement 
and generalization defined by an axiom system, 
including lattice axioms. The ideas of Boldyrev 
have been used often for minimization of Boolean 
partial functions with a large number of “Don’t 
Care” conditions, but we have been interested, 
from the beginning of our investigation, in ap- 
plying the lattice theory for feature extraction 
and classification of attribute-value’s tuples, and 
later, of concepts (symbols, names...). 

The formal concept analysis (FCA), based on 
the concept lattice, has been advanced by Wille 
(1992). The problems of the FCA have been 
extensively studied by Stumme et al. (Stumme, 
Taouil, Bastide, Pasquier, & Lakhal, 2000), Dowl- 
ing (1993), Salzberg (1991). Some algorithms for 
building concept lattices are considered in Nourine 
and Raynaud (1999), Ganter (1984), Kuznetsov 
(1993), and Kuznetsov and Obiedkov (2001). 

A lot of experience has been obtained on the 
application of algebraic lattices in machine learn- 
ing. From this point of view, the JSM-method 
of reasoning (Finn, 1984, 1988, 1991, 1999) is 
interesting. 

The JSM-method of hypotheses’ automatic 
generation formalizes a special class of plausible 
reasoning. The technique of this method is a syn- 
thesis of several cognitive procedures: empirical 
induction based on modeling John S. Mill’s joint 
rule of similarity-distinction (Mill, 1900), causal 
analogy, and Charles S. Peirce’s abduction. 

Similarity in the JSM-method is both a relation 
and an operation that is idempotent, commutative 
and associative (i.e., it induces a semilattice on 
objects’ descriptions and their generalizations). 
Being described in algebraic terms, the JSM- 
method can be implemented in the procedural 
programming languages. 

In Galitsky et al. (Galitsky, Kuznetsov, & 
Vinogradov, 2005), the system JASMINE, based 
on the JSM-method, is presented. The system 
extends this methodology by implementing (1) a 
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combination of abductive, inductive, and analogi- 
cal reasoning for hypotheses generation, and (2) 
multivalued logic-based deductive reasoning for 
verification of their consistency. Formally, all the 
above components can be represented as deduc- 
tive inference via logic programming (Anshakov, 
Finn, & Skvortsov, 1989; Finn, 1999). In fact, 
JASMINE is based on the logic programming 
implementation (Vinogradov, 1999). 

The idea of using algebraic lattices for knowl- 
edge or data representation is realized by a lot 
of researchers. We can mention some of them: 
the works of the French group (Ganascia, 1989); 
the work on conceptual clustering (Carpineto & 
Romano, 1996); the works related to conceptual 
knowledge discovery (Mephu & Njiwoua, 1998; 
Stumme, Wille, & Wille, 1998). The following 
works are devoted to the application of algebraic 
lattices for extracting functional and implicative 
dependencies from data: Demetrovics and Vu 
(1993), Mannila and Raiha (1992), Mannila and 
Raiha (1994), Huntala et al. (Huntala, Karkkai- 
nen, Porkka, & Toivonen, 1999), Cosmadakis et 
al. (Cosmadakis, Kanellakis, & Spyratos, 1986), 
Naidenova and Polegaeva (1986), Megretskaya 
(1988), Naidenova et al. (Naidenova, Polegaeva, 
& Iserlis, 1995a), Naidenova et al. (Naidenova, 
Plaksin, & Shagalov, 1995b), Naidenova (1992, 
2001 . 

An advantage of the algebraic lattices approach 
is based on the fact that an algebraic lattice can 
be defined both as an algebraic structure that is 
declarative, and as a system of dual operations 
with the use of which the elements of this lattice 
and the links between them can be generated. 

Our approach to machine-learning problems 
is based on the concept of a good diagnostic 
(classification) test. We have chosen the lattice 
theory as a model for inferring good diagnostic 
tests from examples from the very beginning of 
our work in this direction. This concept has been 
advanced firstly in the framework of inferring 
functional and implicative dependencies from 
relations (Naidenova & Polegaeva, 1986). But 


later, the fact has been revealed that the task of 
inferring all good diagnostic tests for a given set 
of positive and negative examples can be formu- 
lated as the search of the best approximation of 
a given classification on a given set of examples, 
and that it is this task that some well-known 
machine-learning problems can be reduced to 
(Naidenova, 1996): finding keys and functional 
dependencies in database relations, finding as- 
sociationrules, finding implicative dependencies, 
inferring logical rules (if-then rules, rough sets, 
“ripple down” rules), decision tree construction, 
learning by discovering concept hierarchies, 
eliminating irrelevant features from the set of 
exhaustively generated features. 

In this chapter, we would like to demonstrate 
the possibility of transforming a large class of 
machine-learning algorithms for inferring good 
classification tests into the commonsense reason- 
ing processes based on using well-known logical 
reasoning rules. 

In this chapter, we describe the forms of an 
expert’s rules (rules of the first type). The rules of 
the first type can be represented with the use of 
only one class of logical rules based on implicative 
dependencies between concepts (names). Then 
we describe commonsense reasoning operations 
(deductive and inductive) or rules of the second 
type. The concept of a good diagnostic test is 
introduced, and the problem of inferring all good 
diagnostic tests for a given classification on a 
given set of examples is formulated. We give the 
description of the mathematical model underlying 
algorithms of inferring good tests from examples. 
This model allows one to demonstrate that the 
inferring good tests entails applying deductive 
and inductive commonsense reasoning rules of 
the second type. We propose a decomposition of 
learning algorithms into operations and subtasks 
with the use of which good diagnostic tests infer- 
ring is transformed into an incremental process. 
The concepts of an essential value and an essen- 
tial example are also introduced. We describe 
an incremental learning algorithm DIAGaRa 


44 


Reducing a Class of Machine Learning Algorithms to Logical Commonsense Reasoning Operations 


and an approach to incremental inferring good 
diagnostic tests. The chapter ends with a brief 
summary section. 

THE LOGICAL REASONING RULES 

We need the following three types of rules in 
order to realize logical inference (deductive and 
inductive): 

• INSTANCES or relationships between ob- 
jects or facts really observed. Instance can 
be considered as a logical rule with the least 
degree of generalization. On the other hand, 
instances can serve as a source of a training 
set of positive and negative examples for 
inductive inference of generalized rules. 

• RULES OF THE FIRST TYPE, or logical 
rules. These rules describe regular relation- 
ships between objects and their properties 
and between properties of different objects. 
The rules of the first type can be given ex- 
plicitly by an expert, or derived automatically 
from examples with the help of some learning 
process. These rules are represented in the 
form of “if-then” assertions. 

• RULES OF THE SECOND TYPE or infer- 
ence rules with the help of which rules of 
the first type are used, updated, and inferred 
from data (instances). The rules of the second 
type embrace both inductive and deductive 
reasoning rules. 

The Rules of the First Type 

The rules of the first type can be represented with 
the use of only one class of logical statements; 
namely, the statements based on implicative de- 
pendencies between names. Names are used for 
designating concepts, things, events, situations, 
or any evidences. They can be considered as at- 
tributes’ values in the formal representations of 
logical rules. In our further consideration, the 


letters A, B, C, D, a, b, c, d . . .will be used as 
attributes’ values in logical rules. 

We consider the following rules of the first 
type: 

• Implication: a,b, c — » d. This rule means 
that if the values standing on the left side 
of the rule are simultaneously true, then the 
value on the right side of the rule is always 
true. 

• Interdiction or forbidden rule: (a special 
case of implication) a, b, c — >• false {never). 
This rule interdicts a combination of values 
enumerated on the left side of the rule. The 
rule of interdiction can be transformed into 
several implications such as a, b — » not c; a, 
c — » not b; b, c — » not a. 

• Compatibility: a,b, c — » rarely; a,b, c — » 
frequently. This rule says that the values 
enumerated on the left side of the rule can 
simultaneously occur rarely (frequently). 
The rule of compatibility presents the most 
frequently observed combination of values 
that is different from a law or regularity, 
with only one or two exceptions. 
Compatibility is equivalent to a collection 
of assertions as follows: 

a, b -» c rarely ( frequently ) 

a, c — » b rarely ( frequently ) 

b, c — > b rarely ( frequently ) 

• Diagnostic rule: x, d — » a; x, b — » not a; d, 
b — » false. For example, d and b can be two 
values of the same attribute. This rule works 
when the truth of “x” has been proven and it 
is necessary to determine whether “a ” is true 
or not. If “x & d” is true, then “a” is true, 
but if “x & b” is true, then “a” is false. 

• Rule of alternatives: aorb— » true (always); 
a,b^r false. This rule says that a and b can- 
not be simultaneously true; either a or b can 
be true, but not both. This rule is a variant 
of interdiction. 
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Deductive Reasoning Rules of the 
Second Type 

Deductive steps of commonsense reasoning con- 
sist of inferring consequences from some observed 
facts with the use of statements of the form “if- 
then” (i.e., knowledge). For this goal, deductive 
rules of reasoning are applied, the main forms of 
which are modus ponens, modus tollens, modus 
ponendo tollens, and modus tollendo ponens. 

Let x be a collection of true values of some 
attributes (or evidences), observed simultane- 
ously. 

• Using implication: Let r be an implication, 
left(r) be the left part of r and right(r) be the 
right part of r. If left(r) cz x, then x can be 
extended by right(r): x <— x u right(r). Using 
implication is based on modus ponens: if A, 
then B; A; hence B. 

• Using interdiction: Let r be an implication 
y — » not k. If left(r) c x, then k is a forbid- 
den value for all the extensions of x. Using 
interdiction is based on modus ponendo 
tollens: either A or B (A, B - alternatives); 
A; hence not B; either A or B; B; hence not 
A. 

• Using compatibility: Let r = “a, b, c — » 

k, rarely (frequently),” where rarely, fre- 
quently are the values of a special attribute 
(SA). If left(r) c x, then k can be used for an 
extension of x with the value of SA equal to 
“rarely” (“frequently”). The application of 
several rules of compatibility leads to the 
appearance of several values “rarely” and/or 
“frequently” in the extension of x. 

Computing the value of SA for the extension of 
x requires special consideration. In any case, the 
appearance of at least one value “rarely” means 
that the total result of the extension will have the 
value of SA equal to “rarely.” Two values equal to 
“frequently” lead to the result “less frequently,” 
three values equal to “frequently” lead to the 


result “less less frequently,” and hence the values 
“rarely” and “frequently” must have the ordering 
scale of measuring. 

Using compatibility is based on modus po- 
nens. 

• Using diagnostic rules: Letrbe a diagnostic 
rule such as “x, d — >• a; x, b — > not a,” where 
“x” is true, and “a,” “not a” are hypotheses 
or possible values of some attribute. Using 
a diagnostic rule is based on modus ponens 
and modus ponendo tollens. 

There are several ways for refuting one of 
the hypotheses: 

1. To infer either d or b with the use of 
one’s knowledge; 

2. To involve new known facts and/or 
statements for inferring (with the use 
of inductive reasoning rules of the 
second type) new rules of the first type 
for distinguishing the hypotheses “a” 
and “not a”; to apply these new rules; 

3. To get, from an observation, which of 
the values d or b is true? 

• Using rule of alternatives: Let “ a ” and 
“b” be two alternative hypotheses about 
the value of some attribute. If one of these 
hypotheses is inferred with the help of 
reasoning operations, then the other one is 
rejected. Using a rule of alternatives is based 
on modus tollendo ponens: either A or B (A, 
B - alternatives); not A; hence B; either A or 
B; not B; hence A. 

The operations enumerated can be named as 
“forward reasoning” rules. 

Experts also use implicative assertions in a 
different way. This way can be named as “back- 
ward reasoning.” 

• Generating hypothesis or abduction rule: 

Let r be an implication y — » k. Then the fol- 
lowing hypothesis is generated “if k is true, 
then it is possible that y is true.” 
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• Using modus tollens: Let r be an implica- 
tion _y — » k. If “not k” is inferred, then “not 
y” is also inferred. 

Natural diagnostic reasoning is not any 
method of proving the truth. It has another goal: 
to infer all possible hypotheses about the value 
of some target attribute. These hypotheses must 
not contradict with the expert’s knowledge and 
the situation under consideration. The process 
of inferring hypotheses is reduced to extending 
maximally a collection x of attribute values such 
that none of the forbidden pairs of values would 
belong to the extension of x. 

Inductive Reasoning Rules of the 
Second Type 

Inductive steps of common sense reasoning con- 
sist of using already known facts and statements, 
observations, and experience for inferring new 
logical rules of the first type or correcting those 
that turn out to be false. 

For this goal, inductive reasoning rules are ap- 
plied. The main forms of induction are the canons 
of induction that have been formulated by English 
logician Mill (1900). These canons are known as 
the five induction methods of reasoning: method 
of only similarity, method of only distinction, 
joint method of similarity-distinction, method of 
concomitant changes, and method of residuum. 

• The method of only similarity: This rule 
means that if the previous events (values) A, 
B, C lead to the events (values) a, b, c and 
the events (values) A, D, E lead to the events 
(values) a, d, e, then A is a plausible reason 
of a. 

• The method of only distinction: This rule 
means that if the previous events (values) 
A, B, C lead to (or give rise to) the events 
(values) a, b, c and the events (values) B, C 
lead to the events (values) b, c, then A is a 
plausible reason of a. 


The joint method of similarity-distinc- 
tion: This method consists of applying two 
previous methods simultaneously. 

The method of concomitant changes: This 
rule means that if the change of a previous 
event (value) Ais accompanied by the change 
of an event (value) a, and all the other previ- 
ous events (values) do not change, then A is 
aplausible reason of a. 

The method of residuum: Let U be a com- 
plex phenomenon abed, and we know that 
A is the reason of a, B is the reason of b, 
and C is the reason of c. Then it is possible 
to suppose that there is an event D that is a 
reason of d. 


THE CONCEPT OF A GOOD 
CLASSIFICATION TEST 

Our approach to machine-learning problems is 
based on the concept of a good diagnostic (clas- 
sification) test. A good classification test can be 
understood as an approximation of a given clas- 
sification on a given set of examples (Naidenova, 
1996; Naidenova & Polegaeva, 1986). 

A good diagnostic test is defined as follows. 
Let R be a set of examples and S = (1, 2, . . .i, . . ., 
n) be the set of indices of examples, where n is the 
number of examples of R. Let R(+) and S(+) be the 
set of positive examples and the set of indices of 
positive examples, respectively. Let R(-) = R/R(+) 
denote the set of negative examples. Let U be 
the set of attributes and T be the set of attributes 
values (values, for short), each of which appears 
at least in one of the examples of R. 

Denote by s(A), A e T the subset {i e S: A 
appears in t., t. s R}, where S = (1, 2, .., n}. 

Following Cosmadakis et al. (1986), we call 
s(A) the interpretation of A e Tin /?. The defini- 
tion of s(A) can be extended to the definition of 
s(t) for any collection ftef of values as follows: 
if t - A 1 A 2 ... A m , then s(t) = s(A L ) n s(A 2 ) n ... 
ns(A m ). 
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Definition 1. A collection tcT(s(t)^ 0) of values 
is a diagnostic test for the set #(+) of examples if 
and only if the following condition is satisfied: t 
<£ t*, V t*, f*e R(-) (the equivalent condition is 
s(t)cS(+)). 

Let k be the name of a set R(k) of examples. To 
say that a collection t of values is a diagnostic test 
for R(k ) is equivalent to say that it does not cover 
any example f* t* g R(k). At the same time, the 
condition s(t) c S(k) implies that the following 
implicative dependency is true: “if t, thenk.” Thus 
a diagnostic test, as a collection of values, makes 
up the left side of a rule of the first type. 

It is clear that the set of all diagnostic tests for 
a given set f?(+ ) of examples (call it “DT(+)”) is 
the set of all the collections t of values for which 
the condition s(t) c S(+) is true. For any pair of 
diagnostic tests t, t. from DT(+), only one of the 
following relations is true: s(t.) c s(t.), s(t.) m s(t.), 
s(t.) « s(t.), where the last one means that s(t.) and 
s(t.) are incomparable, that is, s(t.) <£ s(t.) and s(t.) 
<Z s(t.). This consideration leads to the concept of 
a good diagnostic test. 

Definition 2. A collection tcT (s(t) ^ 0) of 
values is a good test for the set R(+) of examples 
if and only if s(t) c S(+) and, simultaneously, the 
condition s(t) cz s(t *) cz S(+) is not satisfied for 
any t* t*c T, such that t* ^ t. 

Now we shall give the following definitions. 

Definition 3. A collection t of values is irredundant 
if for any value v e t the following condition is 
satisfied: s(t) cz s(t/v). 

If a collection t of values is a good test for 
/?(+) and, simultaneously, it is an irredundant 
collection of values, then any proper subset of t 
is not a test for R(+). 

Definition 4. A collection t c T of values is 
maximally redundant if for any implicative de- 


pendency X — » v, which is satisfied in R, the fact 
that t contains X implies that t also contains v. 

If t is a maximally redundant collection of 
values, then for any value v g f, v e T the follow- 
ing condition is satisfied: s(t) z> s(t u v). In other 
words, a maximally redundant collection t of 
values covers the number of examples greater than 
any collection (t u v) of values, where v?t. 

If a diagnostic test t for a given set #(+) of 
examples is a good one and it is a maximally 
redundant collection of values, then for any value 
v g t, v s T the following condition is satisfied: 
(t u v) is not a good test for f?(+). 

Any example t in R is a maximally redundant 
collection of values because for any value vgt, 
v e T s(t u v) is equal to 0. 

For example, in Table 1, the collection “Blond 
Bleu” is a good irredundant test for class 1 and, 
simultaneously, it is maximally redundant collec- 
tion of values. The collection “Blond Embrown” 
is a test for class 2 but it is not good and, simul- 
taneously, it is maximally redundant collection 
of values. 

The collection “Embrown” is a good irredun- 
dant test for class 2. The collection “Red” is a 
good irredundant test for class 1. The collection 
“Tall Red Bleu” is a good maximally redundant 
test for class 1. 

It is clear that the best tests for pattern rec- 
ognition problems must be good irredundant 
tests. These tests allow constructing the shortest 
rules of the first type with the highest degree of 
generalization. 

One of the possible ways for searching for 
good irredundant tests for a given class of posi- 
tive examples is the following: first, find all good 
maximally redundant tests; second, for each good 
maximally redundant test, find all good irredun- 
dant tests contained in it. This is a convenient 
strategy as each good irredundant test belongs to 
one and only one good maximally redundant test 
with the same interpretation (Naidenova, 1999). 
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Table 1. Example lofData Classification (This example is adopted from (Ganascia, 1989)) 


Index of example 

Height 

Color of hair Color of eyes 

Class 

1 

Low 

Blond 

Bleu 

1 

2 

Low 

Brown 

Bleu 

2 

3 

Tall 

Brown 

Embrown 

2 

4 

Tall 

Blond 

Embrown 

2 

5 

Tall 

Brown 

Bleu 

2 

6 

Low 

Blond 

Embrown 

2 

7 

Tall 

Red 

Bleu 

1 

8 

Tall 

Blond 

Bleu 

1 


.Note to Table 1 and all the following tables: the values of attributes must not be considered as the words of English language, 
they are the abstract symbols only 


THE DUALITY OF GOOD 
DIAGNOSTIC TESTS 

In our definition of good tests, we used, implicitly, 
correspondences of Galois G on SxT and two rela- 
tions S — > T, T— >• S (Ore, 1944; Riguet, 1948). Let 
s c S, t c T. We define the relations as follows: 
S — » T: t(s) = (intersection of all t.: t. cT, z s s} 
and T — » S: s(f) = {/: ieS,tc t.}. 

Extending s by an index j* of some new ex- 
ample leads to receiving a more general feature 
of examples: 

(tuA)Dt implies s(t u A) c s(t). 

Extending t by a new value A leads to decreas- 
ing the number of examples possessing the general 
feature “tA” in comparison with the number of 
examples possessing the general feature “t”: 

(tuA)Dt implies s(t uA)c s(t). 

Now we shall introduce the following gener- 
alization operations (functions): 

generalization_of(r) = t' = t(s(t )); generaliza- 
tion^^) -s' - s(t(s)). 

As a result of the generalization of s, the 
sequence of operations s — » t(s) — » s(t(s)) gives 


that s(t(s)) 3 s. This generalization operation 
gives the maximal set of examples possessing 
the feature t(s). 

As a result of the generalization of t, the 
sequence of operations t—>s(t) —> t(s(t )) gives that 
t(s(t)) 3 1. This generalization operation gives the 
maximal general feature for examples the indices 
of which are in s(t). 

These generalization operations are not artifi- 
cially constructed operations. One can perform, 
mentally, a lot of such operations during a short 
period of time. We give some examples of these 
operations. Suppose that somebody has seen 
two films (s) with the participation of Gerard 
Depardieu (t(s)). After that he tries to know all 
the films with his participation (s(f(s))). One can 
know that Gerard Depardieu acts with Pierre 
Richard (t) in several films (s(t))- After that he 
can discover that these films are the films of the 
same producer, Francis Veber t(s(t)). 

Namely, these generalization operations are 
used for searching for good diagnostic tests. 

Now we define a diagnostic test as a dual 
object, that is, as a pair (SL, TA), SL cz S, TA c= T, 
SL = s(TA) and TA = t(SL). 

Definition 5. Let PM = (s 1 , s , . . ., s m } be a family 
of subsets of some set M. Then PM is a Sperner 
system (Sperner, 1928) if the following condition 
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is satisfied: s. <V s. and s. <z s., V(z,j), i ^j, i, j = 
1, ..., m. 

Let R, S, R(+), R(-), S(+) be defined as be- 
fore. 

Definition 6. To find all good maximally redun- 
dant tests (GMRTs) for a given class R(+) of ex- 
amples means to construct a family PS of subsets 
Sj, s 2 , . . ., s., .. ., s np of the set S(+) such that: 

1. PS is a Sperner system. 

2. Each s. is a maximal set in the sense that 

j 

adding to it the index i, such that z g s., z e 
S(+), implies s(t(s. u i )) <z S(+). Putting it 
in another way, t(s. u i ) is not a test for the 
class R(+). 

The set TGOOD of all GMRTs is determined 
as follows: (f: t(s.), s. e PS, Vj -l, ..., np}. 

Definition 7. To find all good irredundant tests 
(GIRTs) for a given class R(+) of examples means 
to construct a family PRT of subsets q, t 2 , . . ., t., 
...,t of the set T such that: 

3 nq 

1. Vt., j = 1, nq, t. cZ t, Vt, t e R(-) and, 
simultaneously, Vt., j = 1, . . ., nq, s(t.) ^ 0 
there does not exist a collection s* ^ s(t ), s* 
cz S of indices such that s(t ) cs*c S(+). 

2. PRT is a Sperner system. 

3. Each t. is a minimal set in the sense that 

j 

removing from it any value A belonging to 
it implies s(t. without A) <z S(+). 

GENERATION OF DUAL OBJECTS 
WITH THE USE OF LATTICE 
OPERATIONS 

Let R be a table of examples and S, T are defined 
as before. Let MUT be the set of all dual objects, 
that is, the set of all pairs (s, t), s cz S, t czT, s = 
s(t), and t = t(s). This set is partially ordered by 


the relation <, where (s, t) < (s*, t*) is satisfied if 
and only if s c s* and t m t*. 

The set ¥ = (MUT, u, n) is an algebraic lat- 
tice, where operations u, n are defined for all 
pairs (s*, t*), (s, t) s MUT in the following way 
(Wille, 1992): 

(s*, t*) u (s, t) = ((s* u s), (t* n t )), 

(s*, t*) n (s, t) = ((s* n s), (t* u t )). 

The unit element and the zero element are (S, 
0) and (0, T), respectively. 

Inferring good tests is reduced to inferring, 
for any element (s*, t*) e MUT, all the elements 
nearest to it in the lattice with respect to the or- 
dering <, that is, inferring all (s, t ), that (s*, t*) < 
(s, t), and there does not exist any (s'**, t**) such 
that (s* t*) < (s**, t**) < (s, t), or inferring all (s, 
t), that (s* t*) > (s, t ), and there does not exist any 
(s**, t**) such that (s*, t*) > (s**, t**) > (s, f). 

Inferring the chains of lattice elements ordered 
by the inclusion relation lies in the foundation of 
generating all types of diagnostic tests: 

S 0 C...CS i CS i+1 C...CS m (t(S 0 )Dt(S 1 )D...D 

t(s) 2 t(s i+1 ) 2 ... 2 f(sJ), 

(2) t 0 c ... c tc t. +1 c...ct m (s(t Q ) 2 s(t,) 3 ... 
m s(t) m s(t i+1 ) 2 ••• 2 s(tj). (1) 

Inductive Rules for Constructing 
Elements of a Dual Lattice 

We use the following variants of inductive tran- 
sition from one element of a chain to its nearest 
element in the lattice: 


1. 

from S q = (zj, z 2 , . 
Vi)’ 

••• ! q ) t0 s q+ i = (u V 

2. 

from t q = (A : , A v . 

V); 

••’ A q )to Vi = (a,.a 2> 

3. 

from S q = (zj, z' 2 , ..., 

■Zq) t0 V = (l 'l’ V 

4. 

from t q = (A v A 2 , . 

A ,). 
q-i y 

•-A,) t° t q _, = ( A ,. A 2. 
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We need the special rules for realizing these 
inductive transitions. 

The Generalization Rule 

The generalization rule is used to get all the col- 
lections of indices s , = (r, T , ... i, i ,}froma 
collection s q = {/), i 2 , . . . z q } such that t(s q ) and t(s q+1 ) 
are tests for a given class of positive examples. 

The termination condition for constructing a 
chain of generalizations is: for all the extension 
s q+1 of s q , t(s q+1 ) is not a test for a given class of 
positive examples. 

Consider some of the possible realizations of 
this rule for inferring GMRTs. 

The first variant of generalization rule. Let 
S(test) be the partially ordered set of elements s = 
(z'p i 2 , . . . z' q }, q = 1, 2, . .., nt - 1 obtained as a result 
of generalizations and satisfying the following 
condition: t(s) is a test for R(+). Here nt denotes the 
number ofpositive examples. Let STGOOD be the 
partially ordered set of elements s satisfying the 
following condition: t(s) is a GMRT for /?(+). 

Next we use an inductive rule for extending 
elements of S(test) and constructing (ip z' 2 , . . . z } 
from (ip i 2 , . . . z }, q = 1, 2, nt - 1. This rule 
relies on the following consideration: if the set 
{ip z „ . . . i } corresponds to a test for R(+), then 
all its proper subsets must correspond to tests too 
and, consequently, theymustbeinS(test). Having 
constructed a set s ={L,L,...i ,}, we deter- 
mine whether it corresponds to the test or not. If 
f(s q+1 ) is not a test, then s q+1 is deleted, otherwise 
it is inserted in S(test). If all the extensions of s 
do not correspond to tests, then s corresponds to 
a GMRT and it is inserted in STGOOD. 

The function to_be _test(t) is defined as follows: 
if s(t) n S(+) = s(t) then true else false. 

This variant of generalization rule is used 
in an algorithm of inferring GMRTs given in 
Naidenova and Polegaeva (1991). An analogous 
inductive extension of items’ collections is also 
used in two algorithms, Apriory and AprioryTid, 
proposed in Agrawal and Srikant (1994) for min- 


ing association rules between items in a large 
database of sales transactions. 

The second variant of generalization rule. 
This rule allows for each element s the follow- 
ing: 

• To avoid constructing the set of all its sub- 
sets. 

• To avoid the repetitive generation of it. 

Consider a way for choosing indices admis- 
sible for extending s q . 

Suppose that S(test ) and STGOOD are not 
empty and s e S(test). Construct the set V: 

V-{u s’,sc s’, s’ e { S(test ) u STGOOD}}. 

The set V is the union of all the collections 
of indices in S(test) and STGOOD containing s; 
hence, s is in the intersection of these collections. 
If we want an extension of s not to be included in 
any element of (S(test) u STGOOD}, we must use, 
for extending s, the indices not appearing simul- 
taneously with s in the set E The set of indices, 
candidates for extending s, is the set: 

CAND(s ) = nts/V, where nts - { u s, s e S(test)}. 

An index j* e CAND(s ) is not admissible for 
extending s if, at least for one index z e s, the 
pair (z, j*} either does not correspond to a test 
or it corresponds to a good test (it belongs to 
STGOOD). Let Q be the set of forbidden pairs of 
indices for extending s: Q- {{;', j}c S(+): t({ /, j }) 
is not a test for R(+)}. Then the set of admissible 
indices is select(s ) = (z, i e CAND(s ): (Vj) (j e 
s), {/,)} g {STGOOD or Q}}. 

The set Q can be generated in the beginning 
of searching all GMRTs for R(+). 

The procedure EXTENSION (s) takes selects) 
and returns the set ext(s ) of all possible extensions 
of s in the form snew = (s u j), j e select(s ) and 
snew corresponds to a test for R(+). This procedure 
executes the function generalization_of(snew) 
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for each element snew e ext(s ) (for this function, 
please see the introduction to Section 4). 

If ext(s ) and the set V are empty, then s cor- 
responds to a GMRT for R(+) and s is transferred 
from S(test) to STGOOD. If ext(s ) contains one and 
only one element, then this element corresponds 
to a GMRT, it is inserted in STGOOD and s is 
deleted from S(test). In all other cases, the set 
ext(s) substitutes s in S(test). 

This variant of generalization rule is a complex 
process in which both deductive and inductive 
reasoning rules of the second type are performed 
(please, see Table 2). The knowledge acquired 
during the process of generalization (the sets Q, 
S(test), STGOOD ) is used for pruning the search 
in the domain space. 

The generalization rule realizes the joint 
method of similarity-distinction. The extending 
of s results in obtaining the subsets of positive 
examples of more and more power with more 
and more generalized features (set of values). 
This operation is analogous to the generalization 
rule applied for star generation under conceptual 
clustering (Michalski, 1983). 

An algorithm, NIAGaRa, based on this vari- 
ant of generalization rule is used in Naidenova 
(2001), for inferring GMRTs. 

The Specification Rule 

The specification rule is used to get all the col- 
lections of values t q+1 = \A { , A v . . ., A q+1 ) from a 


collection t = (A,, A., . . ., A } such that t and t 
are irredundant collections of values, and they are 
not tests for a given set of positive examples. 

The termination condition for constructing a 
chain of specifications is: for all the extensions t 
of t , t , is either a redundant collection of values 
or a test for a given set of positive examples. 

This rule is used for inferring GIRTs. 

The first variant of specification rule. Let 

TGOOD be the partially ordered set of elements 
t satisfying the following condition: t is a good 
irredundant test for R(+). We denote by SAFE 
the set of elements t such that t is an irredundant 
collection of values but not a test for R(+). 

Next we use an inductive rule for extending 
elements of SAFE and constructing t = (A p A v 
■ ■ ■, A q+1 ) from t q = (A : , A,, . . ., A q ) q = 1, 2, .., no 
- 1, where na is the number of values in the set 
T. This rule relies on the following consideration: 
if the collection of values (A,, A,, ... A ,} is an 
irredundant one, then all its proper subsets must 
be irredundant collections of values too and, 
consequently, they must be in SAFE. Having con- 
structed a set t j = {A v A 2 , ... A q+1 }, we determine 
whether it is the irredundant collection of values 
or not. If the collection t , is redundant, then it 
is deleted from SAFE. If it is the test for R(+), 
then it is transferred from SAFE to TGOOD. If 
f q+1 is irredundant but not a test for /?(+), then it 
is a candidate for extension. 


Table 2. Using deductive and inductive rules of the second type 


Inductive rules 

Process 

Deductive and inductive rules of the second type 

Generalization rule 




Forming Q 

Generating forbidden rules 


Forming CAND(s) 

The joint method of similarity-distinction 


Forming selects) 

Using forbidden rules 


Forming ext(s) 

The method of only similarity 


Function_to_be test(t) 

Using implication 


Generalization_of(snew) 

Lattice operations 
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We use the function to be irredundant(t) = if 
(VA.) (A g t) s(t) s(t/ A) then true else false 

It is easy to see that this variant of specifica- 
tion rule is algorithmically equivalent to the first 
variant of the generalization rule. This rule is used 
in an algorithm given in Megretskaya (1988) for 
inferring GIRTs. 

The second variant of specification rule. It is 

an inductive extension rule containing a method 
for choosing admissible values for extending t in 
case of t is not a test, but its extension is a test for 
/?(+)■ We extend t by choosing admissible values 
as follows: these values appear simultaneously 
with t in the examples of R(+), and do not appear 
with t in any example of R(-). These values are to 
be said essential ones. To get them, we construct 
two sets V(-) and V(+) as follows: V(-) = (u t t 
c t\ t’ e R(-)}; V(+) = {u t’: t c= t’, t’ g R(+)}. 
The set ess(t) of essential values for t is equal to 
V(+)/V(-). Thus searching essential values requires 
a special reasoning operation, a diagnostic induc- 
tion reasoning rule. 

The Inductive Diagnostic Rule 

The inductive diagnostic rule is used to get a 
collection of values t q+1 = (A 1 , A 2 , . . ., A q+1 ) from 
a collection t - {A„ A.,, . . . A } such that t is not 

q L r 2’ q J q 

a test, but f q+1 is a test for a given set of positive 
examples. 

We extend t q by choosing values that appear 
simultaneously with it in the examples of R(+), 
and do not appear in any example of R(-). These 
values are to be said essential ones. 

Definition 8. Let t be a collection of values that 
is a test for a given set of positive examples. We 
say that the value A in t is essential if ( t/A ) is not 
a test for a given set of positive examples. 

Generally, we are interested in finding the 
maximal subset sbmax(t ) c t such that t is a test, 
but sbmax(t) is not a test for a given set of posi- 


tive examples. Then sbmin(t ) = t/ sbmaxft ) is the 
minimal set of essential values in t. 

This inductive rule generates diagnostic rules 
of the first type. It is based on the inductive method 
of only distinction. We see that the diagnostic 
rules of the first type obtained with the use of 
inductive diagnostic rules are used immediately 
in the process of good tests construction. 

An analogous rule is defined in Michalski 
(1983) and Michalski and Larson (1978). If a 
newly presented training example contradicts 
an already constructed concept description, the 
specialization rule is applied to generate a new 
consistent concept description. 

The Dual Inductive Diagnostic Rule 

The dual inductive diagnostic rule is used to get 
a collection of indices s , = (/, L, ..., i ,) from 
a collection s q = (/), z' 2 , . . ., z' q ) such that f(s q : ) is 
a test, but t(s q ) is not a test for a given set of 
positive examples. This rule uses a method for 
choosing indices admissible for deleting from s . 
By analogy with an essential value, we define an 
essential example. 

Definition 9. Let s be a subset of indices of posi- 
tive examples; assume also that t(s ) is not a test. 
The example t., j g s is to be said an essential 
one if t(s/j ) proves to be a test for a given set of 
positive examples. 

Generally, we are interested in finding the 
maximal subset sbmax(s ) c s such that t(s) is not 
a test, but t’ = t(sbmax(s)) is a test for a given set 
of positive examples. Then sbmin(s ) = s/sbmax(s ) 
is the minimal set of indices of essential examples 
in s. 

The dual inductive diagnostic rule is used for 
inferring compatibility rules of the first type. The 
number of indices in sbmax(s ) can be understood 
as a measure of “carrying-out” for an acquired 
rule related to sbmax(s), namely, t(sbmax(s)) — > 
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k(R(+ )) frequently, where k(R(+ )) is the name of 
the set R(+). 

Assume s* is a collection of indices of posi- 
tive examples such that f(s*) is not a test. Next we 
describe the procedure with the use of which a 
quasi-maximal subset qsbmax(s *) cz s* is obtained 
such that t(qsbmax(s *)) is a test for a given set of 
positive examples. 

We begin with the first index i of s*, then 
we take the next index z, of s* and evaluate the 
function to_be_test (fdq, /,})). If the value of this 
function is “true,” then we take the next index z 
of s* and evaluate the function to_be_test (t({z x , z 2 , 
i })). If the value of the function to be test (t({ 
i })) is “false,” then the index i of s* is skipped 
and the function to_be_test (f({ / 15 i 3 })) is evalu- 
ated. We continue this process until we achieve 
the last index of s*. 

The dual inductive diagnostic rule is based on 
the inductive method of only distinction. 

We see that the compatibility rules of the first 
type, obtained with the use of dual inductive di- 
agnostic rule, are used immediately in the process 
of good tests construction. 

The rules for constructing diagnostic tests as 
elements of dual lattice generate logical rules of 
the first type, as shown in Table 3. 


THE DECOMPOSITION OF 
INFERRING GOOD DIAGNOSTIC 
TESTS INTO SUBTASKS 

To transform good diagnostic tests inferring into 
an incremental process, we introduce two kinds 
of subtasks (Naidenova & Ermakov, 2001): 

For a given set of positive examples: 

1. Given a positive example t, find all GMRTs 
contained in t. 

2. Given a nonempty collection of values X 
(maybe only one value) such that it is not a 
test, find all GMRTs containing X. 

The subtask of the first kind. We introduce 
the concept of an example’s projection proj(R)[t] 
of a given positive example t on a given set R(+) 
of positive examples. The proj(R)[f] is the set Z = 
(z: (z is nonempty intersection of t and t ’) & ( t ’ 
e R(+)) & (z is a test for a given class of positive 
examples)}. 

If the proj(R)[f] is not empty and contains 
more than one element, then it is a subtask for 
inferring all GMRTs that are in t. If the projec- 
tion contains one and only one element equal to 
t, then t is a GMRT. 


Table 3. Deductive rules of the first type obtained with the use of inductive rules for inferring diagnostic 
tests 


Inductive rules 

Action 

Inferring deductive rules of the first type 

Generalization rule 

Extending s (narrowing t) 

Implications 

Specification rule 

Extending t (narrowing s) 

Implications 

Inductive diagnostic rule 

Searching for essential values 

Diagnostic rules 

Dual inductive diagnostic rule 

Searching for essential examples 

Compatibility rules (approximate implications) 
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The subtask of the second kind. We introduce 
the concept of an attributive projection proj(R)[A] 
of a given value A on a given set R(+) of positive 
examples. 

The projection proj(R)[A] = (f: ( t e R(+J) & (A 
appears in t)}. Another way to define this projec- 
tion is: proj(R)[A] = {t: i e (s(A) n S(+))}. If the 
attributive projection is not empty and contains 
more than one element, then it is a subtask of 
inferring all GMRTs containing a given value A. 
If A appears in one and only one example, then 
A does not belong to any GMRT different from 
this example. 

Forming the projection of A makes sense if 
A is not a test and the intersection of all positive 
examples in which A appears is not a test too, that 
is, s(A) <£ S(+) and t' = t(s(A) n s(+)) is not a test 
for a given set of positive examples. 

The decomposition of good classification tests 
inferring into subtasks of the first and second 
kinds implies introducing a set of special rules 
to realize the following operations: choosing an 


example (value) for a subtask, forming a subtask, 
deleting values or examples from a subtask, and 
some other rules controlling the process of infer- 
ring good tests. 

The following theorem gives the foundation 
for reducing projections both of the first and the 
second kind. The proof of this theorem can be 
found in Naidenova et al. (1995b). 

THEOREM 1 

Let A be a value from T, Abe a maximally redun- 
dant test for a given set R(+) of positive examples, 
and s(A) cr s(A). Then A does not belong to any 
maximally redundant good test for R(+) differ- 
ent from X. 

It is convenient to choose essential values in 
an example and essential examples in a projection 
for the decomposition of inferring GMRTs into 
the subtasks of the first or second kind. 


Table 4. Example 2 of data classification 


Index of example 

Height 

Color of hair 

Color of eyes 

Class 

1 

Low 

Blond 

Bleu 

1 

2 

Low 

Brown 

Bleu 

1 

3 

Tall 

Brown 

Embrown 

1 

4 

Tall 

Blond 

Embrown 

2 

5 

Tall 

Brown 

Bleu 

2 

6 

Low 

Blond 

Embrown 

2 

7 

Tall 

Red 

Bleu 

2 

8 

Tall 

Blond 

Bleu 

2 


Table 5. The subtask for the value “Low” 


Index of example 

Height 

Color of hair 

Color of eyes 

Class 

i 

Low 

Blond 

Bleu 

i 

2 

Low 

Brown 

Bleu 

i 

6 

Low 

Blond 

Embrown 

2 
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We give a small example for inferring all the 
GMRTs for the instances of class 1 presented in 
Table 4. 

In Table 4, we have: S(+) = {1,2,3}, s(Low) — » 
{1,2,6}, s(Brown) -» {2,3,5}, s(Bleu ) -» {1,2, 5, 7, 8}, 
s(Tall ) — > {3, 4, 5, 7, 8}, s(Embrown ) — » {3,4,6}, and 
s(Blond) —>{1,4, 6, 8}. 

We discover that the value “Low” is essential 
in lines 1 and 2. Then it is convenient to form 
the subtask of the second kind for this value as 
shown in Table 5. 

In Table 5, we have: S(+) = {1,2}, s(Low) — > 
{1,2,6}, s(Brown ) — > {2}, s(Bleu) — > {1,2}, and 
s(Blond ) — >{1,6}. 

We have: s(Bleu ) = {1,2} c: S(+). It means that 
the collection of values “Low Bleu” is a test for 
class 1. Analogously, for the value “Brown,” we 
have: s(Brown ) = {2} c S(+). It means that the 
collection of values “Low Brown” is a test for 
class 1 but not a good one because of s(Brown ) 
= {2} c s(Bleu). 

It is clear that these values cannot belong to 
any test different from the tests already obtained. 
We delete “Brown” and “Bleu” from further 
consideration in this subtask. But after deleting 
these values, line 1 and 2 are not tests for class 
1. Hence, the subtask is over. 

Return to the main problem. Now we can de- 
lete the value “Low” from further consideration 
because we have gotten all good tests containing 
this value for class 1. But we know that the value 
“Low” is essential in lines 1 and 2; this fact means 
that these lines are not tests for class 1 after delet- 
ing this value. 

The following step may be the inference of 
all irredundant tests contained in line 3 (cover- 
ing only one line 3) for class 1. In our case, the 
collection of values “Brown Embrown” is a GIRT 
contained in line 3. 

A recursive procedure, based on using at- 
tributive subtasks for inferring GMRTs, has been 
described in Naidenova et al. (1995b). In the fol- 
lowing part of this chapter, we give an algorithm 
based on the subtasks of the first kind combined 


with searching essential examples. This algorithm 
is used only for inferring GMRTs. 

An Algorithm for Inferring GMRTs 
with the use of the Subtask of the 
First Kind 

The algorithm DIAGaRa is the basic recursive 
algorithm for solving a subtask of the first kind. 

The initial information for the algorithm of 
finding all the GMRTs contained in a positive 
example is the projection of this example on the 
current set /?(+). Essentially the projection is 
simply a subset of examples defined on a certain 
restricted subset t* of values. Let s* be the subset 
of indices of positive examples producing the 
projection. 

It is useful to introduce the characteristic W(t) 
of any collection t of values named by the weight of 
tin the projection: W(t) = ||s(t) n s*]| is the number 
of positive examples of the projection containing 
t. Let WMIN be the minimal permissible value 
of the weight. 

Let STGOOD be the partially ordered set of 
elements s satisfying the condition that t(s) is a 
good test for R(+). 

The basic algorithm consists of applying the 
sequence of the following steps: 

• Step 1: Check whether the intersection of 
all the elements of projection is a test and if 
so, then s* is stored in STGOOD if s* cor- 
responds to a good test at the current step; in 
this case, the subtask is over. Otherwise the 
next step is performed (we use the function 
to_be_test(t): if s(t) n S(+) = s(t) ( s(t ) c S(+)) 
then true else false). 

• Step 2: For each value A in the projection, 
the set splus(A ) = {s* n s(A)} and the weight 
W(A) = ||sp/us(A)|| are determined, and if the 
weight is less than the minimum permissible 
weight WMIN, then the value A is deleted 
from the projection. We can also delete 
the value A if W(A) is equal to WMIN and 
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t(splus(A )) is not a test; in this case A will 
not appear in a maximally redundant test t 
with W(t ) equal to or greater than WMIN. 

• Step 3: The generalization operation is 
performed: t' = t(splus(A)), A g t*; if t' is 
a test, then the value A is deleted from the 
projection and splus(A ) is stored in STGOOD 
if splus(A ) corresponds to a good test at the 
current step. 

• Step 4: The value A can be deleted from 
the projection if splus(A ) cz s’ for some s’ 
e STGOOD. 

• Step 5: If at least one value has been deleted 
from the projection, then the reduction of 
the projection is necessary. The reduction 
consists of deleting the elements of projec- 
tion that are not tests (as a result of previous 
eliminating values). If, under reduction, at 
least one element has been deleted from the 
projection, then Step 2, Step 3, Step 4, and 
Step 5 are repeated. 

• Step 6: Check whether the subtask is over 
or not. The subtask is over when either the 
projection is empty or the intersection of all 
elements of the projection corresponds to a 
test (see Step 1). If the subtask is not over, 
then the choice of an essential example in 
this projection is performed and the new 
subtask is formed with the use of this es- 
sential example. The new subsets s* and 
t* are constructed and the basic algorithm 
runs recursively. The important part of 
the basic algorithm is how to form the set 
STGOOD. 

We give in the Appendix an example of the 
work of the algorithm DIAGaRa. 

An approach for forming the set STGOOD. 
Let L(S) be the set of all subsets of the set S. L(S) 
is the set lattice (Rasiova, 1974). The ordering 
determined in the set lattice coincides with the 
set-theoretical inclusion. It will be said that subset 
s x is absorbed by subset s 2 , that is, s x < s 2 , if and 
only if the inclusion relation is hold between them, 


that is, Sj c s 2 . Under formation of STGOOD, a 
collection s of indices is stored in STGOOD if and 
only if it is not absorbed by any collection of this 
set. It is necessary also to delete from STGOOD 
all the collections of indices that are absorbed 
by s if s is stored in STGOOD. Thus, when the 
algorithm is over, the set STGOOD contains 
all the collections of indices that correspond to 
GMRTs and only such collections. Essentially, the 
process of forming STGOOD is an incremental 
procedure of finding all maximal elements of a 
partially ordered set. The set TGOOD of all the 
GMRTs is obtained as follows: TGOOD = {t: t = 
t(s), (Vs) (s g STGOOD)}. 

An Approach to Incremental 
Inferring Good Diagnostic Tests 

Incremental learning is necessary when a new 
portion of observations or examples becomes 
available over time. Suppose that each new 
example comes with the indication of its class 
membership. The following actions are necessary 
with the arrival of a new example: 

• Check whether it is possible to perform 
generalization of some existing GMRTs for 
the class to which the new example belongs 
(class of positive examples), that is, whether 
it is possible to extend the set of examples 
covered by some existing GMRTs or not. 

• Infer all the GMRTs contained in the new 
example. 

• Check the validity of the existing GMRTs for 
negative examples, and if it is necessary: 

• Modify tests that are not valid (test for nega- 
tive examples is not valid if it is included in 
a positive example, that is, in other words, 
it accepts an example of positive class). 

Thus the process of inferring all the GMRTs 
is divided into the subtasks that conform to three 
acts of reasoning: 
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• Pattern recognition or using already known 
rules (tests) for determining the class 
membership of a new positive example and 
generalization of these rules that recognize 
it correctly (deductive reasoning and increas- 
ing the power of already existing inductive 
knowledge). 

• Inferring new rules (tests) that are gener- 
ated by a new positive example (inductive 
reasoning a new knowledge). 

• Correcting rules (tests) of alternative 
(negative) classes that accept a new posi- 
tive example (these rules do not permit to 
distinguish a new positive example from 
some negative examples) (deductive and 
inductive diagnostic reasoning to modify 
knowledge). 

The first act reveals the known rules satisfied 
with a new example, the induction base of these 
rules can be enlarged. 

The second act can be reduced to the subtask 
of the first kind. 

The third act can be reduced either to the 
inductive diagnostic rule and the subtask of the 
first or to the subtask of the second kind. 

CONCLUSION 

This work is an attempt to transform a large class 
of machine-learning tasks into a commonsense 
reasoning process based on using well-known 
deduction and induction logical rules. 

For this goal, we have chosen the task of in- 
ferring good classification (diagnostic) tests for 
a given partitioning on a given training set of 
examples because a lot of well-known machine- 
learning problems, such as inferring functional, 
implicative, and associative dependencies from 
data, are reduced to this task. 

We proposed a unified model for combining 
inductive reasoning with deductive reasoning in 
the framework of inferring and using implicative 


logical rules. The key concept of our approach is 
the concept of a good diagnostic test. We define a 
good diagnostic test as the best approximation of a 
given classification on a given set of examples. 

We have used the lattice theory as the math- 
ematical model for constructing good classifica- 
tion tests. We define a diagnostic test as a dual 
object, that is, as an element of the concept lattice 
introduced in the formal concept analysis. 

The links between dual elements of concept 
lattice reflect both inclusion relations between 
concepts (structural knowledge) and implicative 
relations between concept descriptions (deductive 
knowledge). 

Inferring the chains of lattice elements ordered 
by the inclusion relation lies in the foundation 
of generating all types of diagnostic tests. We 
considered four variants of inductive transition 
from one element of a chain to its nearest element 
in the lattice. We have constructed the special 
rules for realizing these inductive transitions: 
the generalization rule, the specification rule, the 
inductive diagnostic rule, and the dual inductive 
diagnostic rule. 

We have divided commonsense reasoning 
rules in two classes: rules of the first type and 
rules of the second type. The rules of the first 
type are represented with the use of implicative 
logical statements. The rules of the second type 
or reasoning rules (deductive and inductive) are 
rules with the help of which rules of the first 
type used, updated, and inferred from data. The 
deductive reasoning rules of the second type are 
modus ponens, modus ponendo tollens, modus 
tollendo ponens, and modus tollens. The induc- 
tive reasoning rules of the second type are the 
following ones: the method of only similarity, 
the method of only distinction, the joint method 
of similarity-distinction, and some others. The 
analysis of the inference for lattice construction 
allows demonstrating that this inference engages 
both inductive and deductive reasoning rules of 
the second type. During the lattice construction, 
the rules of the first type (implications, interdic- 
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tions, rules of compatibility) are generated and 
used immediately. 

We have introduced the decomposition of 
inferring good tests for a given set of positive 
examples into operations and subtasks that are 
in accordance with human commonsense reason- 
ing operations. This decomposition allows, in 
principle, to transform the process of inferring 
good tests into a “step by step” commonsense 
reasoning process. 

We have given also the algorithm DIAGaRa 
for inferring good maximally redundant tests, 
and an approach to incrementally inferring good 
diagnostic tests. 
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Table 6. The set of positive Examples R(+) 
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Appendix 

An example of using algorithm DIAGaRa. 

The data to be processed are in Table 6 (the set 
of positive examples) and in Table 7 (the set of 
negative examples). 

We begin with s* = S(+) = {{1}, {2}, . . ., {14}}, 

t* = T = (A i; A 2 , , A 2g }, SPLUS = { splus(A . ): 

A e t*} (see SPLUS in Table 8). In Tables 8, 9, A. 
denotes the collection of values {A 8 , A g } and A + 
denotes the collection of values {A 14 ,A 15 } because 
splus(A 8 ) = splus(A g ) and splus(A u ) = splus(A r ). 

We use the algorithm DIAGaRa for inferring 
all the GMRTs having a weight equal to or greater 
than WMIN = 4 for the training set of the positive 
examples represented in Table 6. 

Please observe that splus(A 12 ) = (2, 3, 4, 7} and 
f({2,3,4,7}) is a test; therefore, A n is deleted from 
t* and splus(A ]2 ) is inserted into STGOOD. 

Then W(A_), W(A 13 ), and W(A 16 ) are less than 
WMIN; hence, we can delete A*, A 13 , andA lg from 
t*. Now t is not a test and can be deleted. 


Index of example 

R(+) 

1 

AAAAAAsAAs 

2 

A A A A As A* A5A2 As A 4 As 

3 

A A A A2 As A 4 As As Ao A 4 As 

4 

A A A A A A 12 A 4 As As Ac At A 4 As 

5 

AAAbA. 

6 

AAoAiAs 

7 

A A A A A 12 A 4 As As A2 A 4 As 

8 

A A A A A As A 4 As As Ao At As 

9 

As As A 1S Ao At A 22 As 

10 

A A A A A A A As As Ao At As 

11 

A A A A As Ao At As As 

12 

A A As Ao At As A 4 As 

13 

AAAsAoAsAs 

14 

As At As 
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Table 7 . The set of negative Examples R(-) 


Index 

of example 

R(-) 

Index 

ol example 

R(-) 

15 

AAAs A 23 A 24 

32 

AAAAAAsA. 

16 

AAAAsAb 

33 

AAAAAA.A.A, 

17 

A 1 A 21 A 22 A 24 A 26 

34 

A A A As Ac Ai A 2 A 3 As 

18 

A 1 ^ A S As A 13 A 16 

35 

A A A A A A A A « As 

19 

A 2 A 6 A 7 A 9 A 21 A 23 

36 

A A AAA As As As 

20 

A 10 A 19 A 20 A 21 A 22 A 24 

37 

AAAAAAAA 2 A 4 A 5 A. 

21 

A 1 A 0 A 1 A 2 A 3 A 4 

38 

AAAAAAAA 2 A 3 A. 

22 

AAAAAAs 

39 

A i A A A A A A 14 As A 9 Ao A 3 As 

23 

A 2 A 6 A 8 A 9 A 14 A 15 A 16 

40 

A A A A A A A 12 A 13 A 14 A is As 

24 

A A A A A A As 

41 

A A A A A A A A 2 As A « A is As 

25 

A 7 A 13 A 19 A 20 A 22 A 26 

42 

A i A A A A A A 12 As A 18 A i9 Ao Ai As 

26 

A l A 2 A 3 A 6 A 7 A le 

43 

A A A A A A A 12 A 13 A 14 A is As 

27 

AAAAAAaAs 

44 

A A A A A A A i2 As a m As As As 

28 

AAAA 3 A 9 A 1 

45 

a i A A A A A A A A A As A 4 As 

29 

A 1 A 4 A 5 A 6 A 7 A 8 A 13 A 16 

46 

A A A A A A A 12 A 13 A 14 As As As A 24 

30 

AAAAAbAsAsA. 

47 

AAAAAAAAAsAsAsAaAs 

31 

A 1 A A A A 14 A 15 A 16 A 26 

48 

AAAA 2 A i4 As As 


Table 8. The set SPLUS of the collection splus(A) for all A in Tables 6 and 7 


SPLUS = {splus(A): s(A) n s(+), A e T}: 

splus(A ,) -> {2,8,10} 
splus(A u ) S {3,8,10} 
splus(A m ) -> {4,9,12} 
splus(A t ) — » {1,4,11,13} 
splus(A s ) — » {1,4,7,10} 
splus(A u ) -> {2, 3, 4,7} 
splus(A m ) {3,9,10,13} 

splus(A,) — » {1,5,10,11,12} 
splus(AJ -> {2, 3, 4, 7, 8} 
splus(A a ) {3,8,9,11,13} 

splus(A 22 ) -> {2,7,8,9,11} 

splus(A 23 ) -> {1,2,5,12,13,14} 

splus(A 3 ) -> {3,7,8,10,11,12} 

splus(A 4 ) -> {2,3,4,7,10,13} 

splus(A 6 )-> {1,4,5,7,8,10} 

splus(A 7 ) -> {2,3,4,6,8,11} 

splus(A 24 ) {1,2,3,4,5,7,12,14} 

splus(AJ {4,6,7,8,9,10,11,12} 

splus(A 2I ) {1,4,6,8,9,10,11,12} 

splus(A 26 ) {1,2,3,4,6,7,9,10,11,12,13,14} 
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After modifying splus(A ) for A 5 , A m , A v A 3 , 
A 4 , A g , A 20 , A 21 , and A 2g we find that W(A r ) = 3, 
therefore, A 5 is deleted from t* . 

Then W(Af turns out to be less than WM1N 
and we delete A lg ; this implies deleting t 13 . Next 
we modify splus(A) for A t , A ig , A 23 , A 4 , A 26 and 
find that splus(A 4 ) = (2, 3, 4, 7}. A 4 is deleted from 
t*. Finally, W(A { ) turns out to be less than WM1N 
and we delete A r 

We can delete also the values A 2 , A ig because 
W(A 2 ), W(A ig ) = 4, t(splus(A 2 )), t(splus(A 19 )) are not 
tests and, therefore, these values will not appear 
in a maximally redundant test t with W(t) equal 
to or greater than 4. 

After deleting these values, we can delete 
the examples f 9 , t 5 because A ig is essential in t g , 
and A 2 is essential in t Next we can observe 
that splus(A 23 ) = {1,2,12,14} and t({l,2,12,14}) is 
a test; thus, A 23 is deleted from t* and sp/us(A 23 ) 
is inserted into STGOOD. We can delete the 
values A 22 and A g because W(A 22 ) and W(A f .) are 
now equal to 4, t(splus(A 22 )) and t(splus(AJ) are 
not tests, and these values will not appear in a 
maximally redundant test with weight equal to 
or greater than 4. Now t 14 and t 1 are not tests and 
can be deleted. 

Choose t 12 as a subtaskbecause t(sp/us(A 21 )/{ 12}) 
and t(sp/us(A 24 )/{12}) will be tests. By resolving 
this subtask, we find that t 12 does not produce a 
new test. We delete it. Then splus(A 21 ) is equal 
to {4,6,8,11}, t({4,6,8,ll}) is a test, thus A 21 is 
deleted from f* and splus(A 21 ) is inserted into 
STGOOD. We can also delete the value A n „ because 
t(splus(A 2 J) is the GMRTs already obtained. 


Table 9. The Sets STGOOD and TGOOD for the 
Examples of Tables 6 and 7. 



STGOOD 

TGOOD 

1 

{ 2 , 3 , 4 , 7 } 

A 4 A i2 A + AuAe 

2 

{ 1 , 2 , 12 , 14 } 

A 23 A 24 A 26 

3 

{ 4 , 6 , 8 , 11 } 

AAoAl 


We can delete the value A 3 because W(A,) is 
now equal to 4, t(splus(Af) is not a test, and this 
value will not appear in a maximally redundant 
test with weight equal to or greater than 4. We 
can delete t 6 because now this example is not a 
test. Then we can delete the value A 2Q because 
t(splus(A 20 )) is the GMRTs already obtained. 

These deletions imply that all of the remaining 
rows t 2 , t 3 , t 4 , t 7 , t g , and t n are not tests. 

The list of the GMRTs with the weight equal to 
or greater than WMIN = 4 is given in Table 9. 
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ABSTRACT 

The analysis of quality of services is an important issue for the planning and the management of many 
businesses. The ability to address the demands and the relevant needs of the customers of a given service 
is crucial to determine its success in a competitive environment. Many quantitative tools in the areas of 
statistics and mathematical modeling have been designed and applied to serve this purpose. Here we 
consider an application of a well-established statistical technique, the stated preference models (SP), 
to identify, from a sample of customers, significant weights to attribute to different aspects of the ser- 
vice provided; such aspects may additively compose an overall satisfaction index. In addition, such a 
weighting system is applied to a larger set of customers and a comparison is made between the overall 
satisfaction identified by the SP index and the overall satisfaction directly declared by the customers. 
Such a comparison is performed by two rule-based classification systems, decision trees, and the logic 
data miner Lsquare. The results of these two tools help in identifying the differences between the two 
measurements from the structural point of view, and provide an improved interpretation of the results. 
The application considered is related to the customers of a large Italian airport. 
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INTRODUCTION 

Although quality is recognized as a key tool in 
the management of services, its measurement s t ill 
remains a fairly subjective concept. The range 
of definitions used is vast and spreads from “the 
conformity of the specific or requisites” through 
to “the suitability for use” arriving at the ample 
sphere of“client satisfaction” (Franceschini, 2001 ; 
Negro, 1995). Many statistical and data analysis 
techniques have been proposed to measure the 
effective and perceived quality of the customers 
of a given service. Despite such efforts, some 
aspects of the issue still remain unsettled and 
the decision maker is faced with a number of 
choices to make when he/she has to plan a qual- 
ity measurement campaign. In this chapter, we 
try to extend the range of tools usually deployed 
in this setting, integrating the results of consoli- 
dated techniques for quality surveys, the Stated 
P reference models (SP), with the application of 
rule -based classification algorithms. Such algo- 
rithms are used to analyze the results obtained 
by SP and to compare them with the satisfaction 
level directly declared by the users of the service 
under study. Our intention is to show the appro- 
priateness of such advanced data analysis tools, 
typical of the area of data mining, to perform a 
deeper analysis of the survey data and to better 
understand the structure of the different methods 
available to measure service quality and customer 
satisfaction. The data considered for this applica- 
tion is derived from a survey on airport customers 
conducted on a large Italian airport, where some 
of the variables have been appropriately coded, as 
part of the results obtained are to be considered 
confidential. The results presented are not to be 
considered for interpretation purposes. 

The chapter is organized as follows. The next 
section analyzes in more detail the issue of measur- 
ing the quality of a service through interviews to 
service users. The main techniques available are 
briefly introduced and described. Following, we 
explain, with a larger degree of detail, the main 


concepts behind SP, how such types of surveys 
are built, and what statistical and inferential 
tools are typically used to put such models to 
work. Then, we describe some partial results 
obtained from the realization of an SP survey 
on airport users. Such results are used to infer 
a factor-weighting system for a larger customer 
satisfaction survey. The comparison between the 
quality index obtained by the SP model and the 
one detected directly in the survey is the topic 
of the last two sections of this chapter. In one we 
propose the use of decision trees to compare the 
classification models for both quality indices that 
are obtained from a set of explanatory variables; 
in the other, we use a logic-based data-mining 
system, Lsquare, to derive explanations that 
link, through logic formulas, the overall quality 
index, and the preference level attributed by the 
customers to five relevant factors. Finally, some 
conclusions are drawn. 


MEASURING SERVICE QUALITY 

In marketing literature, the study of service quality 
has focused on its evaluation by the customers. 
When a consumer is put in a central position as 
the final judge of the quality, the typical customer 
satisfaction survey (CSS) is based on the compi- 
lation of assessment by the clients regarding the 
diverse characteristics of the services through 
suitable scales, to which specific graduation tech- 
niques are applied (Edwards, 1957). Above all, 
customer satisfaction market research is used by 
means of questionnaires and verbal scales, which 
the people interviewed use to express judgement 
about the aspects that influence the quality of the 
said service. These scales are usually made up of 
five or seven levels pinpointed by adj ectives , labels, 
or graduated segments. In such a way, the person 
interviewed is able to agree or disagree with each 
item. Each individual identifies an association 
between his own feelings and one of the categories 
in the scale that is offered to him/her. The most 
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common instrument for measuring service quality 
is the Servqual scale, a method that takes inspira- 
tion from the disconfirmation theory, based on the 
difference between the quality conceived and that 
expected by the client (Parasuraman Zeithaml, & 
Berry, 1988). Servqual is atwo-part questionnaire 
containing several statements: one part to measure 
what the client would expect from a general firm 
in the sector to which the service under examina- 
tion refers, the other to assess how the client has 
perceived the service offered. Despite Servqual 
has been applied across a broad variety of service 
contexts, it has been criticized on methodological 
and psychometric grounds by many researchers: 
the Servperf model (Cronin & Taylor, 1992), the 
Evaluated Performance (Teas, 1993), and the 
Retail Service Quality Scale (Dabholkar, Thorpe, 
& Rentz, 1996) are additional examples of how 
the subject has been treated in literature. Ac- 
cording to these approaches, the analysis of the 
data is achieved through multivariate statistical 
techniques such as factorial analysis, hierarchical, 
and multidimensional model. 

Often, the global service-quality index is 
simply computed as the average of the clients’ 
responses on the overall service evaluation and 
then, through the relationship among the latter 
and the judgements on each service dimension 
(attribute), the importance weights of the single 
service characteristics are calculated. Sometimes, 
the importance of each dimension is obtained by 
directly asking respondents to allocate a certain 
amount of points across the dimensions. It is to 
stress that these procedures may lead to partial or 
biased measures. With the aim to overcome this 
problem, we consider an SP survey combined with 
CSS. By doing so, we are able to get the relative 
importance measures of the attributes, jointly 
evaluated, that is, based on an explicit trade-off 
between attributes, and we use the latter infor- 
mation to calculate the service quality indicator 
(SQI in the following). 


STATED PREFERENCE MODELS 

SP methods refers to a family of techniques that 
foresee interviewing individuals concerning their 
preferences regarding a set of different options to 
estimate utility functions. The options are none 
other than descriptions of goods or services that 
differentiate for the characteristics they hold. 
They mainly deal with hypothetical situations 
made up ad hoc by the researcher. By their nature, 
SP methods require purpose-designed surveys 
for their collection of data. Such methods were 
originally developed in the marketing research 
field in the early 1970s, and have become widely 
used since 1978, with the objective of identifying 
the customers’ preferences structure for prod- 
ucts available or not yet available on the market. 
The flexibility of these techniques and the rich 
information that can be extracted allow their ap- 
plication also in transport, environmental, and 
medical fields. 

A preference can be expressed in three differ- 
ent ways: respondent may give a rank between 
options (no metric valuation); they may rate a 
set of alternatives; or they may choose the best 
scenario in a given set. The latter is less infor- 
mative but easier and faster for individuals than 
the other tasks, and it is the one that they make 
in reality, by comparing a set of situations and 
selecting one. Furthermore, this method does not 
require any assumptions to be made about order 
or cardinality measurement (Louviere, 1988). We 
therefore concentrate, in this chapter, on choice- 
based conjoint analysis, whose seminal precursory 
paper was written by Me Fadden (1974). 

The formation of a preference and the 
decisional process are, however, two very 
delicate aspects of the theory of the behavior. 
The huge complexitywhich stems from their 
analysis, a series of simplified measures, as 
well as the knowledge of the theory of the pro- 
cess that leads the individual to give certain 
answers (Louviere, Flensher, & Swait, 2000). 
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The theoretical basis is represented by the mi- 
croeconomic theory of choice and by the random 
utility theory (RUT). The first maintains that each 
decision maker possesses a preference relation, (A), 
amidst the range of possible choices that satisfy 
a rational axiom. Such rationality is guaranteed 
by the completeness and transitivity properties 
(Mas-Colell, Whinston, & Green, 1995), which 
guarantees the representability of the structure 
of the individual’s preferences through the math- 
ematical function U, called utility function, which 
has an ordinal worth. Consider two alternatives, 
i and j (which can be goods or services), belong- 
ing to a set of choices C, meaning a collection of 
available alternatives from which the individual 
is asked to choose, we get: 

i > j <=> U, > Uj (1) 

In this context, utility is defined as the capac- 
ity of the object in question (goods or service) to 
satisfy the needs and meet the preferences of the 
decision maker. The choices will be carried out 
in order to guarantee the highest level of utility 
possible. Utility maximization, as a decisional 
rule, implies that the alternative z would be cho- 
sen if: 

U ; - > U j , Vj*ieC (2) 

A first extension of the microeconomic theory 
for individual choice is suggested by Lancaster 
(1966). Here utility is defined in terms of different 
attributes. The decision, then, would directly stem 
from the utility that springs from the attributes 
and, consequently, the preference towards any 
certain product or service would only be indirect. 
This hypothesis allows one to represent the choice 
between alternatives as between attributes. 

A coherent approach with the above-men- 
tioned measures is RUT, originally proposed by 
Thurstone (1927), by which the decision maker 
has a perfect discrimination capacity, while the 


analyst has incomplete information mainly caused 
by the impossibility to consider all the factors that 
influence the preference of the individual. That 
implies that utility is not an exact known factor 
and must be treated as a random variable, made 
up of a systematic component with a margin of 
error. Utility of alternative i perceived by indi- 
vidual q can be represented as the sum of both a 
systematic component and a random one: 

U it =V il + e iq (3) 

The systematic componentis a function, linear 
in its parameters, of the fundamental attributes: 

T>PX, (4) 

where |3 is the vector of the coefficients associated 
to the vector A' of explanatory variables associated 
with alternative z. The random component is then 
included as it is envisaged that some factors that 
influence the choice of the decision maker are not 
measurable. Manski (1973) identified four fonts 
of randomness due to incomplete information: 
important attributes not taken into consideration, 
preferences not detected that differed between 
individuals, measurement errors, errors gone un- 
noticed. In synthesis itis assumed that the decision 
maker is fully informed, has rational preferences, 
can observe the alternatives with ease and without 
cost, and choose in a rewarding way that which 
offers the greatest utility. 

In the case of choice between two or more 
alternatives, equation (2) becomes: 

V/ * / e C, U, q > U jq o (1A -V jq ) > (z jq -z, q ) 

(5) 

According to RUT, the analyst, not being able 
to observe the difference to the right-hand member 
of the last equation, is not able to indicate, with a 
deterministic concept, when such an inequality is 
valid and, therefore, turns to a probabilistic ap- 
proach. Then, the probability that the individual 


68 


The Analysis of Service Quality Through Stated Preference Models and Rule-Based Classification 


q chooses the alternative z from the set of choices 
C is given by: 

Vy * i, I C) = P[(e JI -e*) < {V iq -V jq )\ 

( 6 ) 

In order to calculate such a probability, it is 
enough to define the statistical distribution of the 
random term, and equation (6) could be rewritten 
as follows: 

Vy * i, 

I Q = J £ J [(e 7? “£*,) < ( V i q ~ V jq )\ /( £ q 

( 7 ) 

where / (e ) is the density function of the random 
vector e = (e j ), while /(•) is the indicator 
function that assumes value 1 when the expres- 
sion in brackets is true, and 0 otherwise. The 
probability that each random term is 

below the observed quantity (y - V jq ) is none 
other than a cumulated distribution that can be 
rewritten in terms of multidimensional integral 
over the density of the unobserved portion of util- 
ity. Different specifications of density, meaning 
different assumptions about the distribution of 
the error term, generate various discrete choice 
models that can be used to analyze the gathered 
choice data with the purpose of estimating the 
P-parameters and calculating an SQ1. 

The most popular models are the Logit and 
the Probit (see (Train, 2003) for reference). The 
first derives from the Gumbel distribution, the 
latter from the Normal distribution. The latter 
has the disadvantage of presenting yet another 
complex calculation; in fact, they do not have 


a closed-form expression for the integral in (7), 
quite the opposite to that which happens for the 
Logit models that are much easier to use. 

The most popular model is the multinominal 
logit (ML), which is expressed as: (see equation 
( 8 )). 

Parameters of this model are estimated using 
maximun likelihood, thus determining the set of 
coefficients that, when inserted in the determin- 
istic part of the utility function, maximize the 
joint probablity across all the observations of the 
choices actually made. 

AN APPLICATION TO AIRPORT 
DATA 

In this section, we illustrate some results extracted 
from a survey conducted by the statistics depart- 
ment of the Sapienza Universita de Rome, for 
one of the major Italian airports. The aim of the 
project was to identify the relative importance 
weights of the dimensions that still characterize 
their own customer satisfaction surveys, and use 
these measures to properly calculate an SQI; as 
stated, we want to link SP methods with CSS. The 
survey was conducted over a period of 9 months, 
and dealt with many aspects related to customer 
satisfaction and perceived service quality. 

While the main survey was taking place, 
we also conducted a parallel survey using the 
SP method. One of the first steps in designing 
a conjoint study is to fix a set of attributes and 
corresponding attribute levels that need to be 
evaluated by the respondents. The identification 
of relevant attributes is usually done through 


Equation 8. 


P^ IO = J i:ri ex P(-^ +T expt-r 8 * )dz iq 

j** 


e 
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literature reviews, focus group discussions, or 
direct questioning. Given our objective, in the 
actual study, attribute levels were simply selected 
according to the items; for these were used verbal 
scales to evaluate five different elements linked 
with the airport service (e.g., airport enviroment, 
waiting, time, and others). 

As anticipated, part of the information is 
confidential to the client of this study and thus, 
from now on, we confine the description of the 
five factors to the coded names FI, F2, F3, F4, F5 
with six qualitative levels from Excellent to Very 
poor, mapped into the values 6, 5, ...1, respec- 
tively. At the same time the results presented are 
to be considered just an example of the methods 
adopted and by no means the true conclusions 
of the study. 

Statistical design theory is used to combine the 
levels of the attributes into a number of alternatives 
to be presented to respondents. The total number 
of options is a function both of the number of 
attributes and of the number of attribute levels. 
Here the total number of possible combinations 
was 7776; however, respondents can only evalu- 
ate a fairly limited number of options because 
of cognitive burden and fatigue. Through a for- 


mal experimental design, we constructed three 
choice sets per interview. One of these choice 
sets had a control function; it was formed by two 
fixed-design alternatives. Dominance refers to a 
situation where one option is superior to another 
on every attribute, so that no trade-offs are in- 
volved in selecting the alternatives. In the final 
analysis, we ignored all the interviews in which 
the agents failed to correctly answer the control 
choice exercise. To allow for a rich variation in the 
combination of attribute levels we used a block 
design and we prepared 200 different versions of 
the survey form. The interview was composed of 
two sections: in the first one, respondents were 
asked to fill in the form in Table 1; in the second 
one, the questions were put in a behavioral choice 
context and the interviewee had to make repeated 
choices between two alternatives. An example of 
a choice set is shown in Table 2. 

Overall, 1,000 face-to-face interviews were 
obtained at the airport station, according to a 
random sampling strategy; such sample size 
guaranteed the desired level of accuracy on the 
estimated probabilities. Table 3 provides informa- 
tion about the first section of the interview. The 
frequencies distributions of the judgments are 


Table 1. Items and verbal scales in CSS 


What is your general judgement on 
the airport? 

Excellent 

(6) 

Good( 5) 

Fairly 
good (4) 

Barely 

Satisfactory 

(3) 

Poor 

(2) 

Very poor 

a) 

Overall evaluation 








What is your opinion on: 

Excellent 

(6) 

Good(5) 

Fairly 
good (4) 

Barely 

Satisfactory 

(3) 

Poor 

(2) 

Very poor 
(1) 

FI 







F2 







F3 







F4 







F5 
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very similar between the airport dimensions. The 
most representative class is Good, about 60% of 
the sample for the overall evaluation and for the 
other attributes, except for F5 where it is 36%. If 
we just assign to the six categories the values from 
1 to 6, we will see, on average, that F3 is the best 
evaluated attribute while F5 is the worst. 

Before getting into the econometric analysis, 
SP data need to be correctly organized. For all of 
the considered attributes there is no continuous 
scale; more than two levels are specified. We need 
to use an effect coding scheme. This creates (Z-l) 
variables that can take the values 1, 0, -1, where 
Z is the number of levels. We decided to exclude 
the lower category. For example, for FI we have 
the situation in Table 4. 


Table 2. Example of a Choice set used in the study 


If you were to have these alternatives available to you, which one would 
vou choose?. 

Factors 

Airport A 

Airport B 


judgement 

judgement 

FI 

Good 

Fairly good 

F2 

Barely satisfactory 

Very poor 

F3 

Very poor 

Good 

F4 

Poor 

Excellent 

F5 

Barely satisfactory 

Fairly good 


□ 

□ 


Now, we turn our attention to the issue of pa- 
rameter estimation. We may obtain information 
about the relative importance of the attribute levels 
by using discrete choice models. In particular, 
since we have only two alternatives per choice 
set, binary logit is used. The estimation results 
are reported in Table 5 . The preferred model is the 
one in which the F3 attribute is recoded as hav- 
ing two categories instead of six, named “Fligh” 
(Fairly good, Good, or Excellent) and “Low” 
(Very poor, poor, Barely satisfactory). Therefore, a 
new single dummy variable is created (F3_h) that 
takes value 1 when the judgment on F3 is “Fligh” 
and 0 otherwise. In this final model, we included 
all the variables that have significant parameter. 


Table 3. Frequencies of the judgements on the airport dimensions 



Very poor \ 

Poor 

n Barely 

! Satisfactory 

j Fairly good j 

Good | Excellent \ 

Mean 

Overall evaluation 

0,3% 

1,2% 

3,0% 

21,3% 

66,7% j 7,5% 

4,75 

FI 


1,8% 



59,0% j 7,3% 

4,63 

F2 

0,3% 

1,8% 

4,8% 

24,6% 

60,2% j 8,3% 

4,67 

F3 

0,6% 

1,7% 

3,5% 

22,1% 

56,3% j 15,9% 

4,80 

F4 

0,4% 

2,1% 

4,3% 

22,4% 

60,1% | 10,7% 

4,72 

F5 

4,2% 

6,2% 

12,1% 

33,7% 

36,1% j 7,7% 

4,15 
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Table 4. Effect coding for factor FI 



VARIABLES 

LEVELS 

FI p 

F2 bs 

FI fg 

FI g 

FI e 

Very poor 

-1 

-1 

1 

1 

-1 

Poor 

1 

o 

0 

0 

o 

Barely Satisfactory 

o 

1 

0 

0 

o 

Fairly qood 

o 

o 

1 

0 

o 

Good 

o 

0 

0 

1 

o 

Excellent 

0 

0 

0 

0 

1 


Table 5. Estimation results of binary logit model 


Discrete choice (binary logit) model 

Maximum Likelihood Estimates 

Log likelihood function = -964.3959 

Pseudo-R2=0. 48923 

Variable 

Coefficient 

Std. Err. 

Ib/St.Er 

P[|Z|>2] 

Fl_p 

-0.5214 

0.0784 

-6.6530 

0.0000 

FI bs 

-0.2341 

0.0718 

-3.2600 

0.0011 

FI fg 

0.2332 

0.0777 

3.0010 

0.0027 

FI g 

0.5543 

0.0762 

7.2750 

0.0000 

FI e 

0.7119 

0.0804 

8.8530 

0.0000 

F2_p 

-0.3031 

0.0722 

-4.1960 

0.0000 

F2 g 

0.2994 

0.0712 

4.2040 

0.0000 

F2 e 

0.5167 

0.0696 

7.4190 

0.0000 

F3 h 

0.2200 

0.0676 

3.2540 

0.0011 

F4_p 

-0.5079 

0.0740 

-6.8650 

0.0000 

F4 bs 

-0.1945 

0.0771 

-2.5220 

0.0117 

F4 fg 

0.1885 

0.0731 

2.5780 

0.0099 

F4 g 

0.3607 

0.0792 

4.5540 

0.0000 

F4 e 

0.6781 

0.0837 

8.1000 

0.0000 

F5_p 

-0.5678 

0.0780 

-7.2800 

0.0000 

F5 fg 

0.2202 

0.0689 

3.1940 

0.0014 

F5 g 

0.5194 

0.0748 

6.9480 

0.0000 

F5 e 

0.7129 

0.0740 

9.6310 

0.0000 


The overall explanatory power of this nonlinear 
model is very good; in fact, a pseudo-R 2 of 0.5 
is equivalent to about 0.8-0.9 for a linear model 
(Domencich & Me Fadden, 1975). In the last two 
columns, the Wald test is reported. 

The relative importance weights of the attribute 
levels are summarized in Table 6. The magnitude 


and the signs are consistent with our a-priori. 
In fact, for each attribute, the effect on utility 
increases, moving from the lowest category Very 
poor to the highest category Excellent. However, it 
should be noticed that this growth is not linear. 

Based on the information presented in Table 
6 and those gathered from the first section of the 
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Table 6. Relative importance weights of the attribute levels 



IMPORTANCE WEIGHTS 

ATTRIBUTES 

Very poor 

Poor 

Barely 

Fairly 

Good 

Excellent 

FI 

-0.7438 

-0.5214 

-0.2341 

0.2332 

0.5543 

0.7119 

F2 

-0.5130 

-0.3031 

0.0000 

0.0000 

0.2994 

0.5167 

F3 

0.0000 

0.0000 

0.0000 

0.2200 

0.2200 

0.2200 

F4 

-0.5249 

-0.5079 

-0.1945 

0.1885 

0.3607 

0.6781 

F5 

-0.8847 

-0.5678 

0.0000 

0.2202 

0.5194 

0.7129 


interview, SQI can be computed through the fol- 
lowing formula: 

S Q‘,=H PA, (io) 

k=l 1=1 

where p is the parameter of the SP model cor- 
responding to the 1-th value of the k-th factor, 
and X klq has value 1 if the judgement expressed 
by user q for factor k is at level 1. Therefore, the 
SQI for user q is obtained by simply adding up 
the importance weights of the attribute levels 
relevant to the judgment expressed by user q. 
Then the overall SQI is measured by taking the 
individual SQI average for the sampled users. 
Table 7 shows the overall SQI and the contribu- 
tions of each attribute. 

The greatest contribution to the actual SQI 
is given by FI attribute, while the smaller one is 
given by F3 attribute. In order to obtain a rela- 
tive measure of SQI, we normalized the index 
in this way: 

0 <SQI'= ~ '^ i,11in <1 (11) 

SQI^-SQI^ 


EXPLANATORY MODELS WITH 
DECISION TREES 

In this section, we consider the construction of ex- 
planatory models forthe satisfaction indices based 
on decision trees. Decision trees are a widely used 
technique to extract knowledge from data. They 
are based on an iterative and hierarchic partition 
of the training set in subsets of decreasing entropy, 
where the entropy is computed on the frequency 
distribution of the nominal variable that is to be 
classified. Given the tree-shaped hierarchic nature 
of the subset identification, the final subsets are 
called leaves. The variables used to split each 
subset into its child are then used to build, in a 
leaf-to-root path, the rule that identifies the subset 
of the training set that represents that leaf. The 
decision trees thus represent a particular type of 
rule -based classification system that partitions the 
training data, and associates with each element of 
the partition a single class of the variable that is 
to be classified (such variable is often referred, in 
the data-mining jargon, as target variable ). 

Extensive variants of such techniques have 
been proposed and refined since the seminal work 


Table 7. Service quality index and its attributes contributions 



SQI 

Contributions 



FI 

F2 

F3 

F4 

F5 

Mean 

1.390 

0.411 

0.213 

0.207 I 

0.312 

0.247 

Minimum 

-2.667 

-0.744 

-0.513 

0 

-0.525 

-0.885 

Maximum 

2.840 

0.712 

0.517 

0.22 

0.678 

0.713 
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of Breiman et al. (Breiman, Friedman, Olshen, 
& Stone, 1984) on classification and regression 
trees. For a detailed description of this method, 
we address the interested reader to the large body 
of available literature. 

Flere we adopt a very flexible and user-friendly 
data mining tool that implements several variants 
of decision trees, the open source software WEKA 
(see Witten & Frank, 2005). WEKA is a large 
software project developed at the University of 
Waikito, New Zealand, that puts together a large 
collection of classification and regression tools 
(among others, neural network support vector 
machines, logistic regressions, associative rules) 
in a common experimental environment where the 
user can edit and preprocess the data files. 

The choice of rule-based models expressed as 
a decision tree is driven by the objective of under- 
standing and interpreting the relations between 
the satisfaction and other characteristics expressed 
by the customers in the survey. In particular, we 
adopt the J48 algorithm, a recent implementation 
of the classical Quinlan’s C4.5 (Quinlan, 1993). 
J48 allows one to control the dimension of the 
tree, thus avoiding potential overfitting from 
data, by two alternative parameters: a pruning 
process, controlled by a confidence factor, and a 


lower bound on the minimum number of elements 
that can be associated with a leaf. While the latter 
parameter is very straightforward and has no rela- 
tion with the characteristics of the data analyzed, 
the former is based on probabilistic models as- 
sociated with each leaf’s data to reduce the size 
of the tree without losing predictive power, and 
may exhibit a more consistent behavior. In the 
following experiments, we tested different levels 
of confidence, maintaining a very small value 
of the lower bound on the minimum number of 
elements per leaf. 

In the previous sections, we have examined 
the process to build a consistent SQI using SP, 
and how this can provide additional informa- 
tion about the customer’s preference structure. 
When computing the SP-based SQI to a larger 
amount of interviews (in this work we present 
partial results obtained on a subset of approx. 
5,400 interviews), we can then compare, for each 
interview, the service quality provided directly 
by the customer in the interview (referred to as 
SAT index, for overall satisfaction ) and the SQI 
computed from the judgment expressed by the 
same customer on the five factors that have been 
considered (namely, FI, F2, F3, F4, and F5 intro- 
duced in the previous section). The two indices (the 


Table 8. Description of Variables for Decision Trees 


Variable Name 

Description 

MON 

month of year 

TIME 

time slot of flight 

SEX 

sex of passenger 

AGE 

age of passenger (classes) 

OCC 

occupation of passenger 

NAT 

nationality of passenger 

FLYER 

flying frequency (Heavy, Light) 

USER 

using airport frequency (Heavy, Light) 

TERM 

Terminal 

FLIGHT 

type of flight (national, international) 

REAS 

reason for travel 

SAT 

overall satisfaction declared by passenger 

SOI 

satisfaction computed by Stated Preference Models 
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SQI satisfaction index and the SAT satisfaction 
index) have a low degree of linear correlation; not 
surprisingly, also the correlation among the SAT 
index and the satisfaction indices associated with 
the service components used to compute the SQI 
index appears to be low. 

Here we construct decision trees, where the 
target variables are, in turn, SAT and SQI, while 
the explicatory variables are chosen among the 
set of variables measured by the survey. In Table 
8, we report the complete list of the variables that 
was submitted to the classification algorithm. 

As anticipated, the SAT index is determined on 
a scale from 1 to 6 (1= minimum, 6=maximum). 
In order to compare with SQI, we rescaled both 
measures from 0 to 1, and then created for both 


indices a dichotomous variable with a threshold 
value of 0.8, obtaining a binary version of the two 
variables (referred to as SAT b and SQI bi respec- 
tively). The values of the two binary variables are 
indicated with GOOD (passengers with satisfac- 
tion of at least 80%) and BAD (passengers with 
satisfaction below 80%). The main interest of 
the study is towards the analysis of the extreme 
values, as one wants to understand what are the 
elements that separate the highly satisfied pas- 
sengers from the rest. In particular, here we want 
to check whether the classification model is able 
to explain why the two indices are different and 
for what type of customers. 

The first experiments reported are related with 
the J48 decision tree, where the target variable is 


Table 9. Performance of Decision Trees J48 for target variable SAT b 


Confidence 

T ,evel 

Lower Bound 

on Leaves 

Percent 

Correct 

True Positive 

Rate 

True Negative 
Rate 

Number of 

I .eaves 

Training 
Time tsecst 

0.5 

2 

95.37% 

94.30% 

96.10% 

71 

0.09 

0.4 

2 

95.51% 

94.20% 

96.40% 

42 

0.09 

0.3 

2 

95.61% 

93.80% 

96.80% 

42 

0.08 

0.25 

2 

95.70% 

93.80% 

97.00% 

15 

0.08 

0.2 

2 

95.90% 

93.50% 

97.50% 

15 

0.09 

0.1 

2 

95.90% 

93.50% 

97.50% 

15 

0.09 


Figure 1. Decision Tree for SAT b 
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the binarized SAT b index. In Table 9, we see the 
performance of the algorithm with different values 
of confidence factor. The table reports overall 
correct recognition rate, false positive and false 
negative rate, number of leaves in the correspond- 
ing tree, and time spent in the training phase. The 
data set was composed of 5 ,422 valid records, and 
straightforward 10-fold cross validation was used 
to produce the results described. 

The performance of the model is of very good 
quality, and we see a high correct recognition rate 
(above 95%) without relevant difference between 
the true positive and the true negative rates. We 
then analyze the structure of the tree that obtains 
the best recognition rate, which, in this case, 
corresponds to that with the lowest number of 
leaves. The tree is obtained with a confidence 
value of 0.20 or lower, and is depicted in Figure 
1; attached to each leaf is the number of elements 
of the complete training set that are associated to 
that leaf (the first figure refers to the elements in 
the class that characterized the leaf, the second 
figure to those of the other class; some leaves still 
exhibit a nonnegligible degree of entropy, due 
to the effect of pruning). At first glance we note 
that the relevant variables for the classification 
are the month of the interview, the nationality 
of the passenger (national or international), the 
flying frequency and the terminal. No significant 
information appears to be carried by the age, the 
sex, the occupation, and the travel reason of the 
airport’s customers. It is interesting to note that 


there is a strong relation between the target vari- 
able and the month of the interview; months from 
February to May present only GOOD level of 
satisfaction, while September and October are 
composed only of BAD records. On the other 
hand, we note that for the months of July and Au- 
gust, the national passengers are dissatisfied (even 
if with some degree of imprecision, as expressed 
by the frequency attached to the related leaves). 
For international passengers during the month of 
August, a more articulated behavior is brought to 
evidence: those who do not fly very often from the 
analyzed airport declare satisfaction (i.e., GOOD 
value of the target variable), while those who use 
the airport often are significantly dissatisfied; in 
particular, those leaving and arriving at one of the 
three terminals (A) and those that use terminal C 
and fly frequently (although the small amount of 
records in these leaves does not provide a strong 
significance to these leaves). 

The same analysis is repeated, substituting 
the target variable SAT b with SQI b , maintaining 
the same set of explanatory variables. In Table 
10, we report the results of the J48 algorithm for 
different values of the confidence factor. In this 
case, the separation problem appears to be slightly 
more difficult, as for the same level of confidence, 
the algorithm requires a larger number of leaves 
for convergence. Training time and recognition 
percentage confirm this evidence; moreover, the 
correct recognition rates are lower than those 
obtained in the previous model, although still 


Table 10. Performance of Decision Trees J48 for target variable SQI b 


Confidence 

Level 

Lower Bound 

on Leaves 

Percent 

Correct 

True Positive 

Rate 

True Negative 
Rate 

Number of 

Leaves 

Training 
Time (secs) 

0.5 

2 

81.92% 

80.70% 

83.10% 

325 

0.44 

0.4 

2 

82.12% 

80.80% 

83.40% 

218 

0.2 

0.3 

2 

80.80% 

81.20% 

83.40% 

140 

0.24 

0.25 

2 

82.57% 

81.60% 

83.50% 

100 

0.22 

0.2 

2 

82.55% 

82.40% 

82.70% 

26 

0.2 

0.1 

2 

82.64% 

83.70% 

81.60% 

19 

0.55 
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balanced between true positive and true negative. 
Nevertheless, the best performance, obtained for 
a small confidence level (0.1), is above 82%, and 
thus the information provided by the model can 
definitely be considered interesting. 

The analysis of the tree (Figure 2) highlights 
several aspects of interest for the comparison of 
the two indices SAT and SQI in their binarized 
versions. At first glance we see that the important 
role of month is maintained in this second model; 
such variable is still selected by the algorithm to 
perform the first, and most relevant split in the 
training data. But here we see that the month of 
April turns out to be strongly characterized by 
dissatisfied clients (BAD value of the SQI b target 
variable) differently from what happened with 
SAT b ; at the same time, the month of May gets a 
split on the time of the day variable, where one 
branch, associated with the morning time slot, 
sees dissatisfied customers in terminals A and C 
and satisfied customers in terminal B. The other 
branch, associated with afternoon and nighttime 
slots, shows an unusual pattern of dissatisfied 
international customers and satisfied national 
ones. As in the SAT b model, in the month of 
July the national passengers are dissatisfied. The 


rest of the tree is substantially equivalent to that 
derived for the SAT b model; the month of August 
is split in the same nodes, as well as September 
and October. 

The comparison of the two trees highlights, 
with a certain precision, the few structural dif- 
ferences between the two indices. The main 
conclusion is that the differences are restricted to 
the months of April and May. In these 2 months it 
appears that the direct evaluation of the satisfaction 
given by the customers is somewhat optimistic 
with respect to the more refined index obtained 
by the SP method. Such results could be prop- 
erly interpreted by the users of the survey; one 
possible interpretation may be related with the 
way the questionnaires were submitted in those 
periods, or with the particular type of traffic there 
present. Of more interest is the coherent structure 
of both trees for the months of July and August, 
where the airport traffic is typically character- 
ized by a larger percentage of leisure travelers 
and international traffic. In both months, the 
dissatisfaction expressed by national travelers 
emerges with strong evidence. Analogously, we 
record the negative characterization of September 
and October; although, here we see that while the 


Figure 2. Decision Tree for SAT b 
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direct index (SAT b ) reports that all customers 
are not in the high satisfaction class, the second 
index (SQI b ) presents a large proportion of satis- 
fied customers in the month of September (282 
compared to 896). 

A LOGIC MODEL FOR THE 
RELATION OF SPECIFIC AND 
SYNTHETIC INDICES 

As already stated in the previous sections, one 
of the main concerns in the measurement of cus- 
tomer satisfaction is how to express the overall 
satisfaction stated by customers in terms of the 
satisfaction declared about specific aspects of 
the service. In the application here described, 
we have, at hand, an overall satisfaction index 
expressed by customers and the satisfaction index 
expressed, by the same customers, on five ele- 
ments that characterize the specific service (here 
coded as FI, F2, F3, F4, F5). The deployment of 
an SP-based model has indicated a set of weights 
to attribute to each one of these elements in order 
to derive a more reliable index of the general 
satisfaction on the service. Such an index can be 
used to assess the effect on the overall satisfac- 
tion of specific actions on the different elements. 
Flere we intend to analyze the relations between 
the direct satisfaction index (refered to as SAT) 
and the derived index (SQI) from another angle. 
We intend to find an explanatory model that 
links the value of overall index to the values of 
the five elements. Obviously, such a model must 
exist for the SQI index, which is computed by a 
linear combination of the values attributed to the 


five elements; on the other hand, we know that 
the linear correlation between the SAT index 
and the level of satisfaction expressed for the five 
elements is moderately poor (0.609). We restrict 
our attention to the binarized version of the two 
indices, defining two classes (below and above 
80% of the maximum value of satisfaction), as 
in the previous section. Such simplification may 
reduce the intrinsic variance of the indices; on the 
other hand, it forces the focus of the analysis on 
a marked difference in the judgments expressed 
by the customers. 

For this purpose we use the logic data miner 
Lsquare. Lsquare is a classification method based 
on a logic programming formulation of the clas- 
sification problem, where the problem of finding 
a set of logic rules that separates the training 
data is made equivalent to the solution of a finite 
sequence of minimum cost satisfiability problems 
(MIN SAT). A description of the system is avail- 
able in Felici and Truemper (2002, 2005) and 
Felici, Sun, Truemper K. (2006). A more detailed 
discussion of the methods and mathematical tools 
on which the system is based can be found in 
Truemper (2004). 

For the derivation of the logic model we use, 
in turn, the SAT b and the SQI b indices as target 
variables, and as explanatory variables, the sat- 
isfaction level expressed by the customers, on a 
scale from 1 to 6, on the five factors. The values 
of the explanatory variables are automatically 
transformed by the software into one or more 
logic variables by the identification of outpoints 
that best express the separation capability of 
these variables with respect to the two values of 
the target variable. 


IF 


F1=BAD 

& 

F5=BAD 

Hor 



Fl=GOOD 

& 

F3=BAD 

& 

F5=BAD 

"loR 


Fl=GOOD 

& 

F3=GOOD 

& 

F3=BAD 

& F4=BAD 

& F5=BAD 

F1=BAD 

& 

F4=BAD 

& 

F5=BAD 

lOR 


F2=BAD 

& 

F3=BAD 

& 

F4=GOOD 

& F5=GOOD 

OR 

Fl=GOOD 

& 

F3=BAD 

& 

F l=GOOD 

& F5=GOOD 



THEN satisfaction = BAD 
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The first set of results that we examine is re- 
lated with the model where SAT b plays the role 
of the target variable. We split the available data 
into a training set composed of 700 records, 400 
from those having GOOD satisfaction levels and 
300 from the those having BAD (the training set 
is extracted at random from the 5,422 available 
records according to the uniform distribution). The 
remaining data is used for testing purposes. 

The system identifies a single outpoint for each 
variable, located at the value 4.5; each satisfaction 
judgment, from 1 to 6, is then mapped into a logic 
variable that has value true if the value is equal 
to 5 or 6, and false otherwise. Interestingly, the 
binary mapping computed by Lsquare strongly 
resembles the partition of the overall satisfaction 
index adopted ex-ante for the target variable. For 
symmetry, we will indicate with GOOD and BAD 
also the binary value of the five explanatory vari- 
ables; for example, with F3 = GOOD we intend 
that the satisfaction level expressed on this factor 
is greater or equal to 5. 

The best formula obtained is composed of six 
logic clauses, has value true for BAD records, 
and has an overall precision of 75.3%, divided 
into 66.1% on GOOD records and 89.3% on BAD 
records (such percentages are computed on the 
4,722 records kept aside for testing). Thus, such 
formula holds true for 89.3% of the records with 
BAD overall satisfaction in the test set, and 
holds false for 66.1% of records with GOOD 
overall satisfaction in the test set. Next, we re- 
port a description of the formula. With a deeper 
analysis of the formula produced by Lsquare, we 
can identify a much more compact formula that 
still exhibits a good behavior. This very simple, 
one-clause formula holds true for 58% of records 
with GOOD and for 93.7% of BAD records out 
of the 4,722 records used for testing, resulting in 
an overall correctness of 72.1%: 

The results just described highlight the dif- 
ficulty to predict, in a consistent way and with 
high precision, the general satisfaction expressed 
by the customers from the satisfaction on the five 


specific aspects of the service. They show some 
additional information: while it is somehow dif- 
ficult to predict the GOOD satisfaction level, it is 
much easier to predict the BAD. In other words, if 
a customer satisfies this rule, we can assume that 
he/she has a GOOD satisfaction level with mild 
confidence; but, on the other hand, a customerthat 
does not satisfy it can very likely be assigned a 
BAD satisfaction level. 

The same analysis is conducted, adopting SQI b 
as the target variable. The entire set is split into 
training and test data exactly in the same way 
adopted for the previous model. The results are 
as follows: The best rule identified is composed 
of seven clauses and exhibits an overall correct 
classification rate of 88.2%; the next formula 
reported holds true for 80.2% of records with 
GOOD and false for 95.9% of BAD records out 
of the 4,722 records used for testing. 

As done in the previous case, we can select the 
best one-clause formula from the ones available. 
Such formula has exactly the same form as the 
one-clause formula derived in the SAT b model: the 
general satisfaction is GOOD when the satisfac- 
tion levels on FI, F2, and F4 are, simultaneously, 
GOOD. The formula has an overall precision on 
the test set of 83.8%, obtained from a correct 
recognition of 70.8% on GOOD records and of 
96.5% on BAD records. 

IF 

If1=GOOD & F3=GOOD & F4=GOOD I 

THEN satisfaction = GOOD 


IF 


F1=BAD 

& 

F4=BAD 

OR 

F2=BAD 

& 

F5=BAD 

OR 

F1=BAD 

& 

F5=BAD 

OR 

F4=BAD 

& 

F5=BAD 

OR 

F1=BAD 

& 

F3=BAD 

OR 

F2=BAD 

& 

F3=BAD 

OR 

F1=BAD 

& 

Tl 

UJ 

II 

o 

o 

o 

a 

& F3=GOOD 


THEN satisfaction = BAD 
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From these results we can again conclude that the 
system of preferences that supports the customers’ 
satisfaction does not exhibit a clear structure; even 
for the second index (that, we recall, is obtained by 
a linear combination of the five satisfaction levels 
expressed on the detailed services) the best logic- 
based model fails to identify, with high precision, 
the records with GOOD general satisfaction; on the 
other hand, we have identified very simple logic 
rules that can be considered extremely reliable 
when declaring that a record is in the BAD class. 
Besides, we see that the most compact rule is the 
same for both models. 

CONCLUSION 

In this chapter we have considered the problem of 
service quality measurement using an SP model 
to assess the contribution of different factors to 
the overall value of service quality. To analyze 
the results obtained, we have used two rule-based 
classification techniques, and have exploited some 
characteristics of the results obtained, comparing 
the differential information provided by the SP 
model and from a direct customer survey. Data 
from a survey on a large Italian airport were used. 
The analysis show that the apparent difference be- 
tween the two measures can be restricted to some 
particular subsets of the available data and thus, the 
two indices compared can be better understood and 
put in relation with specific events and customer 
types. The use of rule-based data mining techniques 
allows, to put in clear evidence, the information 
contained in the data, and to orient the decision 
makers in the interpretation of the results. 
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ABSTRACT 

This chapter discusses the use of support vector machines (SVM) for business applications. It provides 
a brief historical background on inductive learning and pattern recognition, and then an intuitive mo- 
tivation for SVM methods. The method is compared to other approaches, and the tools and background 
theory required to successfully apply SVM to business applications are introduced. The authors hope 
that the chapter will help practitioners to understand when the SVM should be the method of choice, as 
well as how to achieve good results in minimal time. 


INTRODUCTION 

Recent years have seen an explosive growth in 
computing power and data storage within busi- 
ness organisations. From a business perspective, 
this means that most companies now have mas- 
sive archives of customer and product data and 
more often than not these archives are far too 
large for human analysis. An obvious question 
has therefore arisen, “How can one turn these 


immense corporate data archives to commercial 
advantage?” To this end, a number of common 
applications have arisen, from predicting which 
products a customer is most likely to purchase, to 
designing the perfect product based on responses 
to questionnaires. The theory and development of 
these processes has grown into a discipline of its 
own, known as Data Mining, which draws heavily 
on the related fields of Machine Learning, Pattern 
Recognition, and Mathematical Statistics. 
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The Data Mining discipline is still developing, 
however, and a great deal of sub- optimal and ad 
hoc analysis is being done. This is partly due to 
the complexity of the problems, but is also due to 
the vast number of available techniques. Even the 
most fundamental task in Data Mining, that of 
inductive inference, or making predictions based 
on examples, can be tackled by a great many dif- 
ferent techniques. Some of these techniques are 
very difficult to tailor to a specific problem and 
require highly skilled human design. Others are 
more generic in application and can be treated 
more like the proverbial “black box.” One par- 
ticularly generic and powerful method, known as 
the Support Vector Machine (SVM) has proven 
to be both easy to apply and capable of produc- 
ing results that range from good to excellent in 
comparison to other methods. While application 
of the method is relatively straightforward, the 
practitioner can still benefit greatly from a basic 
understanding of the underlying machinery. 

Unfortunately most available tutorials on 
SVMs require a very solid mathematical back- 
ground, so we have written this chapter to make 
SVM accessible to a wider community. This 
chapter comprises a basic background on the 
problem of induction, followed by the main 
sections. In the first section we introduce the 
concepts and equations on which the SVM is 
based, in an intuitive manner, and identify the 
relationship between the SVM and some of the 
other popular analysis methods. In the second 
section we survey some interesting applications of 
SVMs on practical real-world problems. Finally, 
the third section provides a set of guidelines and 
rules of thumb for applying the tool, with a peda- 
gogical example that is designed to demonstrate 
everything that the SVM newcomer requires in 
order to immediately apply the tool to a specific 
problem domain. The chapter is intended as a 
brief introduction to the field that introduces the 
ideas, methodologies, as well as a hands-on in- 


troduction to freely available software, allowing 
the reader to rapidly determine the effectiveness 
of SVMs for their specific domain. 

BACKGROUND 

SVMs are most commonly applied to the problem 
of inductive inference, or making predictions 
based on previously seen examples. To illustrate 
what is meant by this, let us consider the data 
presented in Tables 1 and 2. We see here an 
example of the problem of inductive inference, 
more specifically that of supervised learning. In 
supervised learning we are given a set of input 
data along with their corresponding labels. The 
input data comprises a number of examples about 
which several attributes are known (in this case, 
age, income, etc.). The label indicates which class 
a particular example belongs to. In the example 
above, the label tells us whether or not a given 
person has a broadband Internet connection to 
their home. This is called a binary classification 
problem because there are only two possible 
classes. In the second table, we are given the 
attributes for a different set of consumers, for 
whom the true class labels are unknown. Our 
goal is to infer from the first table the most likely 
labels for the people in the second table, that is, 
whether or not they have a broadband Internet 
connection to their home. 

In the field of data mining, we often refer 
to these sets by the terms test set, training set, 
validation set, and so on, but there is some confu- 
sion in the literature about the exact definitions 
of these terms. For this reason we avoid this 
nomenclature, with the exception of the term 
training set. For our purposes, the training set 
shall be all that is given to us in order to infer 
some general correspondence between the input 
data and labels. We will refer to the set of data 
for which we would like to predict the labels as 
the unlabelled set. 
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Table 1. Training or labelled set 


Age 

Income 

Years of 
Education 

Gender 

Broadband Home 
Internet Connection? 

30 

$56,000 / yr 

16 

male 

Yes 

50 

$60,000 / yr 

12 

female 

Yes 

16 

$2,000 /yr 

11 

male 

No 

35 

$30,000 / yr 

12 

male 

No 


The dataset in Table 1 contains demographic information for four randomly selected people. These people were surveyed to 
determine whether or not they had a broadband home internet connection. 


Table 2. Unlabelled set 


Age 

Income 

Years of 
Education 

Gender 

Broadband Home 
Internet Connection? 

40 

$48,000 / yr 

17 

male 

unknown 

29 

$60,000 / yr 

18 

female 

unknown 


The dataset in Table 2 contains demographic information for people who may or may not be good candidates for broadband 
internet connection advertising. The question arising is, “Which of these people is likely to have broadband internet connec- 
tion at home?” 


Figure 1. Inductive inference process in schematic 
form (Based on a particular training set of exam- 
ples with labels, the learning algorithm constructs 
a decision rule which can then be used to predict 
the labels of new unlabelled examples.) 



A schematic diagram for the above process is 
provided in Figure 1. In the case of the SVM clas- 
sifier (and most other learning algorithms for that 
matter), there are a number of parameters which 
must be chosen by the user. These parameters 
control various aspects of the algorithm, and in 
order to yield the best possible performance, it is 
necessary to make the right choices. The process of 


choosing parameters that yield good performance 
is often referred to as model selection. In order 
to understand this process, we have to consider 
what it is that we are aiming for in terms of clas- 
sifier performance. From the point of view of the 
practitioner, the hope is that the algorithm will 
be able to make true predictions about unseen 
cases. Here the true values we are trying to 
predict are the class labels of the unlabelled data. 
From this perspective it is natural to measure the 
performance of a classifier by the probability of 
its misclassifying an unseen example. 

It is here that things become somewhat less 
straightforward, however, due to the following 
dilemma. In order to estimate the probability 
of a misclassification, we need to know the true 
underlying probability distributions of the data 
that we are dealing with. If we actually knew this, 
however, we wouldn’t have needed to perform 
inductive inference in the first place! Indeed 
knowledge of the true probability distributions 
allows us to calculate the theoretically best pos- 
sible decision rule corresponding to the so-called 
Bayesian classifier (Duda, Hart, & Stork, 2001). 


84 


Support Vector Machines for Business Applications 


In recent years, a great deal of research effort 
has gone into developing sophisticated theories 
that make statements about the probability of a 
particular classifier making errors on new un- 
labelled cases — these statements are typically 
referred to as generalization bounds. It turns out, 
however, that the research has a long way to go, 
and in practice one is usually forced to determine 
the parameters of the learning algorithm by much 
more pragmatic means. Perhaps the most straight- 
forward of these methods involves estimating the 
probability of misclassification using a set of real 
data for which the class labels are known — to do 
this one simply compares the labels predicted by 
the learning algorithm to the true known labels. 
The estimate of misclassification probability is 
then given by the number of examples for which 
the algorithm made an error (that is, predicted 
a label other than the true known label) divided 
by the number of examples that were tested in 
this manner. 

Some care needs to be taken, however, in how 
this procedure is conducted. A common pitfall 
for the inexperienced analyst involves making 
this estimate of misclassification probability 
using the training set from which the decision 
rule itself was inferred. The problem with this 
approach is easily seen from the following simple 
decision rule example. Imagine a decision rule 
that makes label predictions by way of the fol- 
lowing procedure (sometimes referred to as the 
notebook classifier): 

The notebook classifier decision rule: We 

wish to predict the label of the example X. If X 
is present in the training set, make the prediction 
that its label is the same as the corresponding 
label in the training set. Otherwise, toss a coin 
to determine the label. 

For this method, while the estimated probabil- 
ity of misclassification on the training set will be 
zero, it is clear that for most real-world problems 
the algorithm will perform no better than tossing 


a coin! The notebook classifier is a commonly 
used example to illustrate the phenomenon of 
overfitting — which refers to situations where the 
decision rule fits the training set well, but does 
not generalize well to previously unseen cases. 
What we are really aiming for is a decision rule 
that generalizes as well as possible, even if this 
means that it cannot perform as well on the 
training set. 

Cross-validation: So it seems that we need a more 
sophisticated means of estimating the generaliza- 
tion performance of our inferred decision rules if 
we are to successfully guide the model selection 
process. Fortunately there is a more effective 
means of estimating the generalization perfor- 
mance based on the training set. This procedure, 
which is referred to as cross-validation or more 
specifically n-fold cross-validation, proceeds in 
the following manner (Duda et al., 2001): 

1. Split the training set into n equally sized 
and disjoint subsets (partitions), numbered 
1 to n. 

2. Construct a decision function using a con- 
glomerate of all the data from subsets 2 to 
n. 

3. Use this decision function to predict the 
labels of the examples in subset number 1. 

4. Compare the predicted labels to the known 
labels in subset number 1. 

5. Repeat steps 1 through 4 a further (n-1) 
times, each time testing on a different sub- 
set, and always excluding that subset from 
training. 

Having done this, we can once again divide the 
number of misclassifications by the total number 
of training examples to get an estimate of the true 
generalization performance. The point is that 
since we have avoided checking the performance 
of the classifier on examples that the algorithm 
had already “seen,” we have calculated a far more 


85 


Support Vector Machines for Business Applications 


meaningful measure of classifier quality. Com- 
monly used values for n are 3 and 10 leading to 
so called 3-fold and 10-fold cross-validation. 

Now, while it is nice to have some idea of 
how well our decision function will generalize, 
we really want to use this measure to guide the 
model selection process. If there are only, say, 
two parameters to choose for the classification 
algorithm, it is common to simply evaluate the 
generalization performance (using cross valida- 
tion) for all combinations of the two parameters, 
over some reasonable range. As the number of 
parameters increases, however, this soon becomes 
infeasible due to the excessive number of param- 
eter combinations. Fortunately one can often 
get away with just two parameters for the SVM 
algorithm, making this relatively straightforward 
model selection methodology widely applicable 
and quite effective on real-world problems. 

Now that we have a basic understanding of 
what supervised learning algorithms can do, as 
well as roughly how they should be used and 
evaluated, it is time to take a peek under the hood 
of one in particular, the SVM. While the main 
underlying idea of the SVM is quite intuitive, it 
will be necessary to delve into some mathemati- 
cal details in order to better appreciate why the 
method has been so successful. 


MAIN THRUST OF THE CHAPTER 

The SVM is a supervised learning algorithm that 
infers from a set of labeled examples a function 
that takes new examples as input, and produces 
predicted labels as output. As such the output of 
the algorithm is a mathematical function that is 
defined on the space from which our examples are 
taken. It takes on one of two values at all points in 
the space, corresponding to the two class labels 
that are considered in binary classification. One 
of the theoretically appealing things about the 
SVM is that the key underlying idea is in fact 
extremely simple. Indeed, the standard deriva- 
tion of the SVM algorithm begins with possibly 
the simplest class of decision functions: linear 
ones. To illustrate what is meant by this, Figure 
2 consists of three linear decision functions that 
happen to be correctly classifying some simple 
2D training sets. 

Linear decision functions consist of a decision 
boundary that is a hyperplane (a line in 2D, plane 
in 3D, etc.) separating the two different regions 
of the space. Such a decision function can be 
expressed by a mathematical function of an input 
vector x, the value of which is the predicted label 


Figure 2. A simple 2D classification task, to separate the black dots from the circles (Three feasible but 
different linear decision functions are depicted, whereby the classifier predicts that any new samples 
in the gray region are black dots, and those in the white region are circles. Which is the best decision 
function and why?) 
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for x (either +1 or -1). The linear classifier can 
therefore be written as: 

g( x ) = sign( f (x)) 
where f (x) =< w, x > +b. 

In this way we have parameterized the func- 
tion by the weight vector w and the scalar b. The 
notation <w,x> denotes the inner or scalar product 
of w and x, defined by: 

d 

<w,x>=^w ; x i 

i=l 

where d is the dimensionality, and w. is the z'-th 
component of w, where w is of the form (w , w , 
. . . w d ). Having formalized our decision function, 
we can now formalize the problem that the linear 
SVM addresses: 

Given a training set of vectors x p x 2 , ... x n with 
corresponding class membership labels y ,y 2 , ... 
y n that take on the values +1 or -1, choose param- 
eters w and b of the linear decision function that 
generalizes well to unseen examples. 

Perceptron Algorithm: Probably the first algo- 
rithm to tackle this problem was the Perceptron 
algorithm (Rosenblatt, 1958). The Perceptron 
algorithm simply used an iterative procedure to 
incrementally adjust w and b until the decision 
boundary was able to separate the two classes 
of the training data. As such, the Perceptron 
algorithm would give no preference between the 
three feasible solutions in Figure 2 — any one of 
the three could result. This seems rather unsat- 
isfactory, as most people would agree that the 
rightmost decision function is the superior one. 
Moreover, this intuitive preference canbe justified 
in various ways, for example by considering the 
effect of measurement noise on the data — small 
perturbations of the data could easily change the 
predicted labels of the training set in the first two 
examples, whereas the third is far more robust in 
this respect. In order to make use of this intuition, 


it is necessary to state more precisely why we 
prefer the third classifier: 

We prefer decision boundaries that not only 
correctly separate two classes in the training 
set, but lie as far from the training examples as 
possible. 

This simple intuition is all that is required to 
lead to the linear SVM classifier, which chooses 
the hyperplane that separates the two classes 
with the maximum margin. The margin is just 
the distance from the hyperplane to the near- 
est training example. Before we continue, it is 
important to note that while the above example 
shows a 2D data set, which can be conveniently 
represented by points in a plane, in fact we will 
typically be dealing with higher dimensional 
data. For example, the example data in Table 
1 could easily be represented as points in four 
dimensions as follows: 

x 1 = [ 3056000 16 0 1]; y 4 = +1 
x 2 = [50 60000 12 1 0]; y, = +1 
x 3 =[16 2000 110 1]; y 3 = -1 
X 4 — [ 35 30000 12 0 1]; y 4 = -1 

Actually, there are some design decisions to 
be made by the practitioner when translating at- 
tributes into the above type of numerical format, 
which we shall touch on in the next section. For 
example here we have mapped the male/female 
column into two new numerical indicators. For 
now, just note that we have also listed the labels 
y 3 to y 4 which take on the value +1 or -1, in order 
to indicate the class membership of the examples 
(that is, y. = 1 means that x. has a broadband home 
Internet connection). 

In order to easily find the maximum margin 
hyperplane for a given data set using a computer, 
we would like to write the task as an optimization 
problem. Optimization problems consist of an 
objective function, which we typically want to 
find the maximum or minimum value of, along 
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Figure 3. Linearly separable classification problem 



<w,x> + b= +1 


<w,x> + b= 0 


with a set of constraints, which are conditions 
that we must satisfy while finding the best value 
of the objective function. A simple example is 
to minimize x 2 subject to the constraint that 1 < 
x < 2. The solution to this example optimiza- 
tion problem happens to be x = 1. To see how 
to compactly formulate the maximum margin 
hyperplane problem as an optimization problem, 
take a look at Figure 3. 

The figure shows some 2D data drawn as 
circles and black dots, having labels +1 and -1 
respectively. As before, we have parameterized 
our decision function by the vector w and the scalar 
b, which means that, in order for our hyperplane 
to correctly separate the two classes, we need to 
satisfy the following constraints: 

< w,x f >+b> 0,for ally, = 1 

< w,x f >+b< 0,for ally,. = -1 

To aid understanding, the first constraint above 
may be expressed as: “<w, x > + b must be greater 
than zero, whenever y. is equal to one.” It is easy 
to check that the two sets of constraints above 
can be combined into the following single set of 
constraints: (< w,x ; >+b)y> > 0,i = l.. .n 

However meeting this constraint is not enough 
to separate the two classes optimally — we need 
to do so with the maximum margin. An easy 
way to see how to do this is the following. First 
note that we have plotted the decision surface as 


a solid line in Figure 3, which is the set satisfy- 
ing:<w,x>+b = 0. 

The set of constraints that we have so far is 
equivalent to saying that these data must lie on 
the correct side (according to class label) of this 
decision surface. Next notice that we have also 
plotted as dotted lines two other hyperplanes, 
which are the hyperplanes where the function 
<w,x> + b is equal to -1 (on the lower left) and 
+1 (on the upper right). Now, in order to find 
the maximum margin hyperplane, we can see 
intuitively that we should keep the dotted lines 
parallel and equidistant to the decision surface, 
and maximize their distance from one another, 
while satisfying the constraint that the data lie 
on the correct side of the dotted lines associated 
with that class. In mathematical form, the final 
clause of this sentence (the constraints) can be 
written as: y, (< w,x, > +b) > 1 , i = l...n. 

All we need to do then is to maximize the 
distance between the dotted lines subject to the 
constraint set above. To aid in understanding, one 
commonly used analogy is to think of these data 
points as nails partially driven into a board. Now 
we successively place thicker and thicker pieces 
of timber between the nails representing the two 
classes until the timber just fits — the centreline 
of the timber now represents the optimal decision 
bou ndary. It t urns out that this distance is equal to 
2 / yj< w,w >, and since maximizin 2 / w, w > 
is the same as minimizing <w, w>, we end up with 
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the following optimization problem, the solution 
of which yields the parameters of the maximum 
margin hyperplane. The term Vi in the objective 
function below can be ignored as it simply makes 
things neater from a certain mathematical point 
of view: 

. 1 

min— w • w 

w,fc> 2 

such that y i (w • x ; + b) > 1 (1) 

for all z = 1,2 ,... n 

The previous problem is quite simple, but 
it encompasses the key philosophy behind the 
SVM — maximum margin data separation. If the 
above problem had been scribbled onto a cocktail 
napkin and handed to the pioneers of the Percep- 
tron back in the 1960s, then the Machine L earning 
discipline would probably have progressed a great 
deal further than it has to date! We cannot relax 
just yet, however, as there is a major problem 
with the above method: What if these data are 
not linearly separable ? That is if it is not pos- 
sible to find a hyperplane that separates all of the 
examples in each class from all of the examples 
in the other class? In this case there would be no 
combination of w and b that could ever satisfy 
the set of constraints above, let alone do so with 
maximum margin. This situation is depicted in 
Figure 4, where it becomes apparent that we need 


to soften the constraint that these data lie on the 
correct side of the +1 and -1 hyperplanes, that is 
we need to allow some, but not too many data 
points to violate these constraints by a preferably 
small amount. This alternative approach turns 
out to be very useful not only for data sets that 
are not linearly separable, but also, and perhaps 
more importantly, in allowing improvements in 
generalization. 

Usually when we start talking about vague 
concepts such as “not too many” and “a small 
amount,” we need to introduce a parameter into 
our problem, which we can vary in order to bal- 
ance between various goals and objectives. The 
following optimization problem, known as the 
1-norm soft margin SVM, is probably the one 
most commonly used to balance the goals of 
maximum margin separation, and correctness of 
the training set classification. It achieves various 
trade-offs between these goals for various values 
of the parameter C, which is usually chosen by 
cross-validation on a training set as discussed 
earlier. 

1 m 
min— w-w+cY x. 

2 tf 

such that y i (v/-x j +b)+x i >1 

* ( 2 ) 
for all i = 1,2,... n. 


Figure 4. Linearly inseparable classification problem 


<w,x> + b= 



<w,x> + b= +1 


<w,x> + b= 0 
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The easiest way to understand this problem is 
by comparison with the previous formulation that 
we gave, which is known as the hard margin SVM, 
in reference to the fact that the margin constraints 
are “hard,” and are not allowed to be violated at 
all. First note that we have an extra term in our 
objective function that is equal to the sum of 
the ^.’s. Since we are minimizing the objective 
function, it is safe to say that we are looking for 
a solution that keeps the values small. More- 
over, since the £, term is added to the original 
objective function after multiplication by C, we 
can say that as C increases we care less about the 
size of the margin, and more about keeping the 
2,’s small. The true meaning of the £,.’s can only 
be seen from the constraint set, however. Here, 
instead of constraining the function y.(<w,x> 
+ b) to be greater than 1, we constrain it to be 
greater than 1 - That is, we allow the point x. 
to violate the margin by an amount c. Thus, the 
value of C trades between how large of a margin 
we would prefer, as opposed to how many of the 
training set examples violate this margin (and 
by how much). 

So far, we have seen that the maximally 
separating hyperplane is a good starting point for 
linear classifiers. We have also seen how to write 
down the problem of finding this hyperplane as an 
optimization problem consisting of an objective 
function and constraints. After this we saw a way 
of dealing with data that is not linearly separable, 
by allowing some training points to violate the 


margin somewhat. The next limitation we will 
address is in the form of solutions available. So 
far we have only considered very simple linear 
classifiers, and as such we can only expect to 
succeed in very simple cases. Fortunately it is 
possible to extend the previous analysis in an 
intuitive manner, to more complex classes of 
decision functions. The basic idea is illustrated 
in Figure 5. 

The example in Figure 5 shows on the left a 
data set that is not linearly separable. In fact, the 
data is not even close to linearly separable, and 
one could never do very well with a linear classi- 
fier for the training set given. In spite of this, it is 
easy for a person to look at the data and suggest 
a simple elliptical decision surface that ought to 
generalize well. Imagine, however, that there is a 
mapping ®, which transforms these data to some 
new, possibly higher dimensional space, in which 
the data is linearly separable. If we knew ® then 
we could map all of the data to the feature space, 
and perform normal SVM classification in this 
space. If we can achieve a reasonable margin in 
the feature space, then we can expect a reason- 
ably good generalization performance, in spite of 
a possible increase in dimensionality. 

The last sentence of the previous paragraph 
is far deeper than it may first appear. For some 
time, Machine Learning researchers have feared 
the curse of dimensionality, a name given to the 
widely-held belief that if the dimension of the 
feature space is large in comparison to the number 


Figure 5. An example of a mapping ® to a feature space in which the data become linearly separable 
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of training examples, then it is difficult to find a 
classifier that generalizes well. It took the theory 
of Vapnik and Chervonenkis (Vapnik, 1998) to 
put a serious dent in this belief. In a nutshell, 
they formalized and proved the last sentence of 
the previous paragraph, and thereby paved the 
way for methods that map data to very high di- 
mensional feature spaces where they then perform 
maximum margin linear separation. Actually, 
a tricky practical issue also had to be overcome 
before the approach could flourish: if we map to 
a feature space that is too high in dimension, then 
it will become impossible to perform the required 
calculations (that is, to find w and b) — that is, it 
would take too long on a computer. It is not obvi- 
ous how to overcome this difficulty, and it took 
until 1995 for researchers to notice the following 
elegant and quite remarkable possibility. 

The usual way of proceeding is to take the 
original soft margin SVM, and convert it to an 
equ i valent Lc/qranq/c/n dual problem. The deriva- 
tion is not especially enlightening, however, so we 
will skip to the result, which is that the solution 
to the following dual or equivalent problem gives 
us the solution to the original SVM problem. The 
dual problem, which is to be solved by varying 
the a.’s, is as follows (Vapnik, 1998): 

j m m 

m o in J Z a j ( X i • X , ) "X q 

^ i,j = 1 i=l 

m 

such that '^ j y i a i = 0 

i=l 

0 <cq < C, / =1, 2,...,m. (3) 

The a.’s are known as the dual variables, and 
they define the corresponding primal variables 
w and b by the following relationships: 

m 

w = Z a ,y, x , 

i=l 

a /(V, (< w,x,. > +b) - 1) = 0 


Note that by the linearity of the inner product 
(that is, the fact that <a+b,c> = <a,c> + <b,c>), we 

can write the decision function in the following 

in 

form: f (x) =< w,x > +b = Va^ <x it x>+b 

i = 1 

Recall that it is the sign of f(x) that gives us 
the predicted label of x. A quite remarkable thing 
is that in order to determine the optimal values 
of the a.’s and b, and also to calculate f(x), we do 
not actually need to know any of the training or 
testing vectors, we only need to know the scalar 
value of their inner product with one another. 
This can be seen by noting that the vectors only 
ever appear by way of their inner product with 
one another. The elegant thing is that rather than 
explicitly mapping all of the data to the new space 
and performing linear SVM classification, we can 
operate in the original space, provided we can find 
a so-called kernel function k(.,.) which is equal to 
the inner product of the mapped data. That is, we 
need a kernel function k(.,.) satisfying: 

k(x,y)=<0(x),0(y)> 

In practice, the practitioner need not concern 
him or herself with the exact nature of the map- 
ping 0. In fact, it is usually more intuitive to 
concentrate on properties of the kernel functions 
anyway, and the prevailing wisdom states that 
the function k(x,y) should be a good measure of 
the similarity of the vectors x and y. Moreover, 
not just any function k can be used — it must also 
satisfy certain technical conditions, known as 
Mercer’s conditions. This procedure of implicitly 
mapping the data via the function k is typically 
often called the kernel trick and has found wide 
application after being popularized by the suc- 
cess of the SVM (Scholkopf & Smola, 2002). 
The two most widely used kernel functions are 
the following. 
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Polynomial Kernel 

k(x,y) = (<x,y> + l) d 

The polynomial kernel is valid for all positive in- 
tegers d > 1. The kernel corresponds to a mapping 
® that computes all degree d monomial terms of 
the individual vector components of the original 
space. The polynomial kernel has been used to 
great effect on digit recognition problems. 

Gaussian Kernel 

k(x,y) = exp(- j |x -y |12 
a 

The Gaussian kernel, which is similar to the 
Gaussian probability distribution from which it 
gets its name, is one of a group of kernel func- 
tions known as radial basis functions (RBFs). 
RBFs are kernel functions that depend only on 
the geometric distance between x and y. The 
kernel is valid for all non-zero values of the kernel 
width a, and corresponds to a mapping ® into an 
infinite dimensional and therefore somewhat less 
interpretable, feature space. Nonetheless, the 
Gaussian is probably the most useful and com- 
monly used kernel function. 

Now that we know the form of the SVM dual 
problem, as well as how to generalize it using 
kernel functions, the only thing left is to see is 
how to actually solve the optimization problem, in 
order to find the a.’s. The optimization problem 
is one example of a class of problems known as 
Quadratic Programs (QPs). The term program, as 
it is used here, is somewhat antiquated and in fact 
means a “mathematical optimization problem,” 
not a computer program. Fortunately there are 
many computer programs that can solve QP’s such 
as this, these computer programs being known as 
Quadratic Program (QP) solvers. An important 
factor to note here is that there is considerable 
structure in the QP that arises in SVM training, 
and while it would be possible to use almost any 
QP solver on the problem, there are a number of 


sophisticated software packages tailored to take 
advantage of this structure, in order to decrease the 
requirements of computer time and memory. 

One property of the SVM QP that can be taken 
advantage of is its sparcity — the fact that in many 
cases, at the optimal solution most of the a.’s will 
equal zero. It is interesting to see what this means 
in terms of the decision function f(x): those vec- 
tors with a. = 0 do not actually enter into the final 
form of the solution. In fact, it can be shown that 
one can remove all of the corresponding training 
vectors before training even commences, and get 
the same final result. The vectors with non-zero 
values of a. are known as the Support Vectors, 
a term that has its root in the theory of convex 
sets. As it turns out, the Support Vectors are the 
“hard” cases — the training examples that are 
most difficult to classify correctly (and that lie 
closest to the decision boundary). In our previous 
practical analogy, the support vectors are literally 
the nails that support the block of wood! Now 
that we have an understanding of the machinery 
underlying it, we will soon proceed to solve a 
practical problem using the freely available SVM 
software package libS VM written by Hsu and Lin 
(Chang & Lin, 2001). 

Relationship to Other Methods 

We noted in the introduction that the SVM is 
an especially easy-to-use method that typically 
produces good results even when treated as a 
processing “black box.” This is indeed the case, 
and to better understand this it is necessary to 
consider what is involved in using some other 
methods. We will focus in detail on the extremely 
prevalent class of algorithms known as artificial 
neural networks, but first we provide a brief over- 
view of some other related methods. 

Linear discriminant analysis (Hand, 1981; Weiss 
& Kulikowski, 1991) is widely used in business 
and marketing applications, can work in multiple 
dimensions, and is well-grounded in the math- 
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ematical literature. It nonetheless has two major 
drawbacks. The first is that linear discriminant 
functions, as the name implies, can only success- 
fully classify linearly separable data thus limiting 
their application to relatively simple problems. If 
we extend the method to higher-order functions 
such as quadratic discriminators, generalization 
suffers. Indeed such degradation in performance 
with increased numbers of parameters corrobo- 
rated the belief in the “curse of dimensionality” 
finally disproved by Vapnik (1998). The second 
problem is simply that generalization performance 
on real problems is usually significantly worse 
than either decision trees or artificial neural 
networks (e.g., see the comparisons in Weiss & 
Kulikowski, 1991). 

Decision trees are commonly used in classification 
problems with categorical data (Quinlan, 1993), 
although it is possible to derive categorical data 
from ordinal data by introducing binary valued 
features such as “age is less than 20.” Decision 
trees construct a tree of questions to be asked of 
a given example in order to determine the class 
membership byway of class labels associated with 
leaf nodes of the decision tree. This approach 
is simple and has the advantage that it produces 
decision rules that can be interpreted by a human 
as well as a machine. However the SVM is more 
appropriate for complex problems with many 
ordinal features. 

Nearest neighbor methods are very simple and 
therefore suitable for extremely large data sets. 
These methods simply search the training data set 
for the k examples that are closest (by the criteria 
of Euclidean distance for example) to the given 
input. The most common class label that associ- 
ated with these k is then assigned to the given 
query example. When the training and testing 
computation times are not so important, however, 
the discriminative nature of the SVM will usually 
yield significantly improved results. 


Artificial neural network (ANN) algorithms 
have become extremely widespread in the area 
of data mining and pattern recognition (Bishop, 
1995). These methods were originally inspired by 
the neural connections that comprise the human 
brain — the basic idea being that in the human 
brain many simple units (neurons) are connected 
together in a manner that produces complex, 
powerful behavior. To simulate this phenomenon, 
neurons are modeled by units whose output y is 
related to the input x by some activation function 
g by the relationship y = g(x). These units are 
then connected together in various architectures, 
whereby the output of a given unit is multiplied 
by some constant weight and then fed forward 
as input to the next unit, possibly in summation 
with a similarly scaled output from some other 
unit(s). Ultimately all of the inputs are fed to one 
single final unit, the output of which is typically 
compared to some threshold in order to produce 
a class membership prediction. This is a very 
general framework that provides many avenues 
for customization: 

• Choice of activation function. 

• Choice of network architecture (number 
of units and the manner in which they are 
connected). 

• Choice of the “weights” by which the output 
of a given unit is multiplied to produce the 
input of another unit. 

• Algorithm for determining the weights given 
the training data. 

In comparison to the SVM, both the strength 
and weakness of the ANN lies in its flexibility — 
typically a considerable amount of experimenta- 
tion is required in order to achieve good results, 
and moreover since the optimization problems 
that are typically used to find the weights of the 
chosen network are non-convex, many numerical 
tricks are required in order to find a good solution 
to the problem. N onetheless, given sufficient skill 
and effort in engineering a solution with an ANN, 
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one can often tailor the algorithm very specifi- 
cally to a given problem in a process that is likely 
to eventually yield superior results to the SVM. 
Having said this, there are cases, for example in 
handwritten digit recognition, in which SVM 
performance is on par with highly engineered 
ANN solutions (DeCoste & Scholkopf, 2002). By 
way of comparison, the SVM approach is likely 
to yield a very good solution with far less effort 
than is required for a good ANN solution. 

Practical Application of the SVM 

As we have seen, the theoretical underpinnings of 
the SVM are very compelling, especially since the 
algorithm involves very little trial and error, and 
is easy to apply. Nonetheless, the usefulness of 
the algorithm can only be borne out by practical 
experience, and so in this sub-section we survey 
a number of studies that use the SVM algorithm 
in practical problems. Before we mention such 
specific cases, we first identify the general char- 
acteristics of those problems to which the SVM 
is particularly well-suited. One key consideration 
is that in its basic form the SVM has limited 
capacity to deal with large training data sets. 
Typically the SVM can only handle problems of 
up to approximately 100,000 training examples 
before approximations must be made in order to 
yield reasonable training times. Having said this, 
the training times depend only marginally on the 
dimensionality of the features — it is often said 
that SVM can often defy the so-called curse of 
dimensionality — the difficulty that often occurs 
when the dimensionality is high in comparison 
with the number of training samples. It should 
also be noted that, with the exception of the string 
kernel case, the SVM is most naturally suited 
to ordinal features rather than categorical ones, 
although as we shall see in the next section, it is 
possible to handle both cases. 

Before turning to some specific business 
and marketing cases, it is important to note that 
some of the most successful applications of the 


SVM have been in image processing — in par- 
ticular handwritten digit recognition (DeCoste 
& Scholkopf, 2002) and face recognition (Osuna, 
Freund & Girosi, 1997). In these areas, a com- 
mon theme of the application of SVM is not so 
much increased accuracy, but rather a greatly 
simplified design and implementation process. 
As such, when considering popular areas such 
as face recognition, it is important to understand 
that very simple SVM implementations are often 
competitive with the complex and highly tuned 
systems that were developed over a longer pe- 
riod prior to the advent of the SVM. Another 
interesting application area for SVM is on string 
data, for example in text mining or the analysis 
of genome sequences (Joachims, 2002). The key 
reason for the great success of SVM in this area 
is the existence of “string kernels” — these are 
kernel functions defined on strings that elegantly 
avoid many of the combinatoric problems as- 
sociated with other methods, whilst having the 
advantage over generative probability models 
such as the Hidden Markov Model that the SVM 
learns to discriminate between the two classes via 
the maximization of the margin. The practical 
use of text categorization systems is extremely 
widespread, with most large enterprises relying 
on such analysis of their customer interactions 
in order to provide automated response systems 
that are nonetheless tailored to the individual. 
Furthermore, the SVM has been successfully 
used in a study of text and data mining for direct 
marketing applications (Cheung, Kwok, Law, & 
Tsui, 2003) in which relatively limited customer 
information was automatically supplanted with 
the preferences of a larger population, in order to 
determine effective marketing strategies. SVMs 
have enjoyed success in a number of other busi- 
ness related applications, including credit rating 
analysis (Huang, Chen, Hsu, Chen, & Wu, 2004) 
and electricity price forecasting (Sansom, Downs, 
& Saha, 2002). To conclude this survey note that 
while the majority of the marketing teams do not 
publish their methodologies, since many of the 
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important data mining software packages (e.g., 
Oracle Data Mining and SAS Enterprise Miner) 
have incorporated the SVM, it is likely that there 
will be a significant and increasing use of the 
SVM in industrial settings. 

A WORKED EXAMPLE 

In “A Practical Guide to Support Vector Clas- 
sification” (Hsu, Chang, & Lin, 2003) a simple 
procedure for applying the SVM classifier is 
provided for inexperienced practitioners of the 
SVM classifier. The procedure is intended to be 
easy to follow, quick, and capable of producing 
reasonable generalization performance. The steps 
they advocate can be paraphrased as follows: 

1. Convert the data to the input format of the 
SVM software you intend to use. 

2. Scale the individual components of the data 
into a common range. 

3. Use the Gaussian kernel function. 

4. Use cross-validation to find the best param- 
eters C (margin softness) and a (Gaussian 
width). 

5. With the values of C and a determined by 
cross-validation, retrain on the entire train- 
ing set. 

The above tasks are easily accomplished using, 
for example, the free libSVM software package, 
as we will demonstrate in detail in this section. 
We have chosen this tool because it is free, easy 
to use and of a high quality, although the major- 
ity of our discussion applies equally well to other 
SVM software packages wherein the same steps 
will necessarily be required. The point of this 
chapter, then, is to illustrate in a concrete fashion 
the process of applying an SVM. The libSVM 
software package with which we do this consists 
of three main command-line tools, as well as a 
helper script in the python language. The basic 
functions of these tools are summarized here: 


• svm-scale: This simple program simply 
rescales the data as in step 2 above. The 
input is a data set, and the output is a new 
data set that has been rescaled. 

• grid.py: This function can be used to assist 
in the cross validation parameter selection 
process. It simply calculates a cross valida- 
tion estimate of generalization performance 
for a range of values of C and the Gaussian 
kernel width a. The results are then illus- 
trated as a two dimensional contour plot of 
generalization performance versus C and 
a. 

• svm-train: This is the most sophisticated 
part of libSVM, which takes as input a file 
containing the training examples, and out- 
puts a “model file” — a list of Support Vectors 
and corresponding a.’s, as well as the bias 
term and kernel parameters. The program 
also takes a number of input arguments that 
are used to specif y the type of kernel function 
and margin softness parameter. As well as 
some more technical options, the program 
also has the option (used by grid.py) of com- 
puting an n-fold cross-validation estimate of 
the generalization performance. 

• svm-predict: Having run svm-train, svm- 
predict canbe used to predict the class labels 
of a new set of unseen data. The input to the 
program is a model file and a data set, and 
the output is a file containing the predicted 
labels, sign(f(x)), for the given data set. 

Detailed instructions for installing the soft- 
ware can be found on the libSVM Web site 
(Chang & Lin, 2001). We will now demonstrate 
these three steps using the example data set at 
the beginning of the chapter, in order to predict 
which customers are likely to be home broadband 
Internet users. To make the procedure clear, we 
will give details of all the required input files 
(containing the labelled and unlabelled data), the 
output file (containing the learned decision func- 
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tion), and the command line statements required The above data would be represented by a single 
to produce and process these files. file that looks like this: 


Preprocessing (svm-scale) 

All of our discussions so far have considered the 
input training examples as numerical vectors. 
In fact this is not necessary as it is possible to 
define kernels on discrete quantities, but we will 
not worry about that here. Instead, notice that 
in our example training data in Table 1, each 
training example has several individual features, 
both numerical and categorical. There are three 
numerical features (age, income and years of 
education), and one categorical feature (gender). 
In constructing training vectors for the S VM from 
these training examples, the numerical features 
are directly assigned to individual components 
of the training vectors 

Categorical features, however, must be dealt 
with slightly differently. Typically, if the categori- 
cal feature belongs to one of m different categories 
(here the categories are male and female so that 
our m is 2), then we map this single categorical 
feature into m individual binary valued numeri- 
cal features. A training vector whose categorical 
feature corresponds to feature n (the ordering is 
irrelevant), will have all zero values for these into 
binary valued numerical features, except for the 
n-th one, which we set to 1. This is a simple way 
of indicating that the features are not related to 
one another by relative magnitudes. Once again, 
the data in Table 1 would be represented by these 
four vectors, with corresponding class labels y.: 

x = [ 30 56000 16 0 1]; y 4 = +1 
x 2 =[ 50 60000 12 1 0];y 2 = +l 
X 3 — [16 2000 110 1]; y 3 = -1 
x 4 =[ 35 30000 12 0 1]; y 4 = -1 

In order to use the libSVM software, we must 
represent the above data in a file that is formatted 
according to the libSVM standard. The format is 
very simple, and best described with an example. 


+1 1:30 2:56000 3:16 5:1 
+1 1:50 2:60000 3:12 4:1 
-1 1:16 2:2000 3:11 5:1 
-1 1:35 2:30000 3:12 5:1 


Each line of the training file represents one 
training example, and begins with the class label 
(+1 or -1), followed by a space and then an arbi- 
trary number of index:value pairs. There should 
be no spaces between the colons and the indexes 
or values, only between the individual index: 
value pairs. Note that if a feature takes on the 
value zero, it need not be included as an index: 
value pair, allowing data with many zeros to be 
represented by a smaller file. 

Now that we have our training data file, we 
are ready to run svm_scale. As we discovered 
in the first section, ultimately all our data will 
be represented by the kernel function evaluation 
between individual vectors. The purpose of this 
program is to make some very simple adjustments 
to the data in order for it to be better represented 
by these kernel evaluations. In accordance with 
step 3 above we will be using the Gaussian kernel, 
which can be expressed by: 


k(x,y) = exp( 


4 


x-y 


o 


= exp(-X 


(x d -y d f 


CT 


Here we have written out the D individual 
components of the vectors x and y, which corre- 
spond to the (D = 5) individual numerical features 
of our training examples. It is clear from the 
summation on the right, that if a given feature 
has a much larger range of variation than another 
feature it will dominate the sum, and the feature 
with the smaller range of variation will essentially 
be ignored. For our example, this means that the 
income feature, which has the largest range of 
values, will receive an undue amount of attention 
from the SVM algorithm. Clearly this is a problem, 
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and while the Machine Learning community has 
yet to give the final word on how to deal with it 
in an optimal manner, many practitioners simply 
rescale the data so that each feature falls in the 
same range, for example between zero and one. 
This can be easily achieved using svm_scale, 
which takes as input a data file in libSVM format, 
and outputs both a rescaled data file and a set of 
scaling parameters. The rescaled data should then 
be used to train the model, and the same scaling 
(as stored in the scaling parameters file) should 
be applied to any unlabelled data before applying 
the learnt decision function. The format of the 
command is as follows: 

svm-scale -s scaling parameters file train- 
ing data file > rescaledtrainingdatafile 

In order to apply the same scaling transformation 
to the unlabelled set, svm_scale must be executed 
again with the following arguments: 

svm-scale -r scaling_parameters_file unla- 
belled data file > 

rescaledunlabelleddatafile 

Here the file unlabelled data file contains 
the unlabelled data, and has an identical format 
to the training file, aside from the fact that the 
labels +1 and -1 are optional, and will be ignored 
if they exist. 

Parameter Selection (grid.py) 

The parameter selection process is without doubt 
the most difficult step in applying an SVM. For- 
tunately the simplistic method we prescribe here is 
not only relatively straightforward, but also usually 
quite effective. Our goal is to choose the C and 
a values for our SVM. Following the previous 
discussion about parameter or model selection, our 
basic method of tackling this problem is to make 
a cross-validation estimate of the generalization 
performance for a range of values of C and a, and 


examine the results visually. Given the outcome 
of this step, we may either choose values for C 
and a, or conduct a further search based on the 
results we have already seen. 

The following command will construct a 
plot of the cross-validation performance for our 
scaled data set: 


grid.py -log2c -5,5,1 -log2g -20,0,1 -v 10 res- 
caledtrainingdatafile 


The search range of the C and a values are 
specified by the -log2c and -log2g commands 
respectively. In both cases the numbers that fol- 
low take the form begin, end, stepsize to indicate 
that we wish to search logarithmically using the 
Vcilucs 2 begin 2 begin+stepsize 2 end 

Specifying “-v n” indicates that we wish to do 
n-fold cross-validation (in the above command 
n = 10), and the last argument to the command 
indicates which data file to use. The output of 
the program is a contour plot, saved in an image 
file of the name rescaled training data file. 
png. The output image for the above command 
is depicted in Figure 6. 

The contour plot indicates with various line 
colors the cross-validation accuracy of the classi- 
fier, as a function of C and a — this is measured as 
a percentage of correct classifications, so that we 
prefer large values. Note that 63 is in fact referred 
to as “gamma” by the libSVM software — the vari- 
able name is of course arbitrary, but we choose to 
refer to it as a for compatibility with the majority 
of SVM literature. 

Given such a contour plot of performance, as 
stated previously there are generally two conclu- 
sions to be reached: 


1. The optimal (or at least satisfactory) values 
of C and a are contained within the plotting 
region. 

2. It is necessary to continue the search for C and 
a, over a different range than that of the plot, 
in order to achieve better performance. 
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Figure 6. A contour plot of cross-validation accuracy for a given training set as produced by grid.py 
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In the first case, we can read the optimal values 
of C and 65 from the output of the program on the 
command window. Each line of output indicates 
the best parameters that have been encountered 
up to that point, and so we can take the last line 
as our operating parameters. 

In the second case, we must choose which 
direction to continue the search. From Figure 6 it 
seems feasible to keep searching over a range of 
smaller a and larger C. This whole procedure is 
usually quite effective, however, there can be no 
denying that the search for the correct parameters 
is still something of a black art. Given this, we 
invite interested readers to experiment for them- 
selves, in order to get a basic feel for how things 
behave. For our purposes, we shall assume that 
a good choice is C = 2' 2 = 0.25 and a = 2 -2 = 0.25, 
and proceed to the next step. 

Training (svm-train) 

As we have seen, the cross-validation process does 
not use all of the data for training — at each itera- 


tion some of the training data must be excluded 
for evaluation purposes. For this reason it is still 
necessary to do a final training run on the entire 
training set, using the parameters that we have 
determined in the previous parameter selection 
process. The command to train is: 

svm-train-g 0.25 -c 0.25 rescaled_training_data_ 
file modelfile 

This command sets C and a using the -c and -g 
switches, respectively. The other two arguments 
are the name of the training data, and finally the file 
name for the learnt decision function or model. 

Prediction (svm-predict) 

The final step is very simple. Now that we have 
a decision function, stored in the file model file 
as well as a properly scaled set of unlabelled data, 
we can compute the predicted label of each of the 
examples in the set of unlabelled data by executing 
the command: 
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svm-predict rescaled_unlabelled_data_file 
modelfile predictionsfile 

After executing this command, we will have a 
new file of the name predictionsfile, Each line of 
this file will contain either “+1” or “—1” depending 
on the predicted label of the corresponding entry 
in the file rescaled unlabelled data. 


CONCLUSION 

The general problem of induction is an important 
one, and can add a great deal of value to large 
corporate databases. Analyzing this data is not 
always simple, however, and it is fortunate that 
methods that are both easy to apply and effective 
have finally arisen, such as the support vector 
machine. 

The basic concept underlying the support 
vector machine is quite simple and intuitive, and 
involves separating our two classes of data from 
one another using a linear function that is the maxi- 
mum possible distance from the data. This basic 
idea becomes a powerful learning algorithm, when 
one overcomes the issue of linear separability (by 
allowing margin errors), and implicit mapping to 
more descriptive feature spaces (through the use 
of kernel functions). 

Moreover, there exist free and easy to use 
software packages, such as libSVM, that allow 
one to obtain good results with a minimum of 
effort. The continued uptake of these tools is 
inevitable, but is often impeded by the poor results 
obtained by novices. We hope that this chapter is 
a useful aid in avoiding this problem, as it quickly 
affords a basic understanding of both the theory 
and practice of the SVM. 
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INTRODUCTION 

Support Vector Machines (SVMs) (Boser et al., 
1992; Cortes & Vapnik, 1995; Vapnik, 1995) 
are a relatively new statistical supervised learn- 
ing method first introduced by Vapnik and his 
co-worker for binary classification problems. 
After that, it has been extended to multi-class 
problems, regression tasks, and novelty detection. 
These statistical learning algorithms are gaining 
rapid popularity due to quite a large number of 
attractive performance results in areas including 
bioinformatics (Guyon et al., 2002), text mining 
(Paab et al., 2002), fraud detection (Hyun-Chul et 
al., 2002), speaker identification (Wan & Renals, 


2002), and database marketing (Bennett et al., 
1999), among many others. 

SVMs adopt the structural risk minimization 
(SRM) principle, as opposed to the empirical risk 
minimization (ERM) approach most commonly 
employed within statistical, neural, and rule-based 
learning methods. This SRM principle has made 
SVMs an excellent tool for improved generaliza- 
tion. A kernel transforms the data points from 
input space to higher dimensional feature space 
by generating the dot product. The feature space 
theoretically couldbe ofinfinite dimension, where 
linear discrimination is possible by constructing 
the optimal hyperplane. This is another significant 
speciality of SVMs compared to other traditional 
learning algorithms. 


Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited. 


Kernel Width Selection for SVM Classification: A Meta-Learning Approach 


The polynomial and radial basis function (RBF) 
kernels are the most popular classical SVM kernel. 
According to Ou et al., RBF kernel is more suitable 
than others SVM kernels (Ou et al., 2003). Hsu et 
al. suggested that in general RBF is a reasonable 
first choice for SVM classification. Up to now a 
good number of kernels have been proposed by 
researchers, but there is no any unique kernel that 
performs best for all problems. The performance 
of the SVM method depends on the suitable selec- 
tion of a kernel. The most common procedure for 
SVM best kernel selection is the trial-and-error 
approach. Joachims argues SVMs are universal 
learners with a simple “plug-in” of an appropriate 
kernel function to learn the problems (Joachims, 
1998). This is a very lengthy procedure due to a 
vast range of kernel function available. Onoda et al. 
argued selection of the suitable kernel for SVM is an 
importantresearch issue forreal world applications 
(Onoda, et al. 2002). A priori kernel selection for 
SVM is a difficult task for the user though (Amari, 
and Wu 1999; Parrado-Hernandez, et al. 2003). 
Clearly, automatic kernel selection is a key issue 
for SVM given the number of kernels available 
rather than the current trial-and-error nature of 
selecting the best kernel for a given problem. We 
found in SVM literature (Joachims, 1998; Morik 
et al., 1999), manually feeding the parametric 
kernel parameter is a traditional approach for 
SVM user. The RBF parameter (width) could be 
selected by optimization approach (Chapelle et al., 
2002). Muller et al. (Muller et al., 2001) suggest 
RBF width could be selected by following cross- 
validation procedure. This is the most common 
way of the RBF width selection method. Carlos 
et al. argue both optimization and cross-valida- 
tion methods are computationally very expensive 
and they suggest selecting the RBF width within 
a range for regression using meta-learning (Carlos 
et al., 2004). Scholkopf (Scholkopf, 2003) sug- 
gested searching the RBF width between 0.2 and 
1. Up to still this is the most popular way to feed 
RBF kernel parameter for SVM. Therefore, it is 
a research issue how to choose automatically the 


most suitable RBF kernel function and its optimum 
width for SVM classification. 

Our methodology seeks to understand the 
characteristics of the data (classification problem), 
understand RBF kernel perform well on which 
types of problems, and generate rules to assist in 
the automatic selection of RBF kernel for SVMs. 
First we classify a wide range of classification 
problems with different kernels and then identify 
the dataset characteristics matrix by statistical 
measures as we have done in some previous related 
work (Smith et al. 2002; Smith et al. 2001). We then 
build models for 112 classification problems (see 
Appendix A) from the UCI Repository (Blake and 
Merz, 2002) and Knowledge Discovery Central 
(Lim, 2002) database using SVM with six differ- 
ent kernels. Finally we use the induction algorithm 
C5.0 (Windows version See5, http://www.rule- 
quest.com/see5-info.html) to generate the rules to 
describe RBF kernel is suitable for which type of 
problem, given the dataset characteristics and the 
performance of RBF kernel on each dataset. We 
also examine the rules by 10 Fold Cross Validation 
(10FC V) performances. Therefore, we estimate the 
RBF width by maximum likelihood (ML) method 
and Nelder-Mead (N-M) simplex method. Based on 
both RBF width estimation methods performance 
we repeat the rule generation procedure and select 
the best rule for width estimation methods with 10 
fold cross validation performance. 

Our chapter is organized as follows: First, we 
provide some theoretical brief review framework 
regarding SVM and rules for RBF kernel selection 
with evaluation. Then we explain the formulation 
for best RBF width selection and its performance 
evaluation with statistical significance test results. 
All statistical measures to identify the dataset 
characteristics matrix are summarized next. After 
that, abrief review on rule-based learning algorithm 
C5.0 and the experimental results post processing 
method and introduction to the rules for best width 
selection methods with evaluation are presented. 
Finally, we conclude our research. 
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SUPPORT VECTOR MACHINE 


a" 


>0,f3r>0,C>0, 


Let us consider a dataset D of / independently 
identically distributed (i.i.d) samples: (x^y), 

(x ; , y ( ). Each sample is a set of feature vectors 
of length m, x. = (pt v •••, xj and the target value 
y g (1, k} that represents the multi-class 
membership. Now, the machine-learning task 
is to learn the classes for each pattern by find- 
ing a classifier with decision functions f(x., a.), 
where f(x., a.) = y., a. e A, V(x.,y) g D, and A is 
a set of abstract parameters. We consider SVM 
to learn this problem. SVM learns the problem 
to estimate the learning parameter by solving 
the quadratic optimization as follows (Weston & 
Watkins, 1999): 


mm<Kro,£) = - Z (ro m -ro m ) + CI Z tT 

w,c 2 m=l idwy,. 

( 1 ) 

subject to: 

((By. ' X ; ) + by. > (co m ■ x ; - ) + b m + 2 — 

J l J l 

c™ >0, /=!,•••, lme{l,---,k}\y t 


where co = weight, c = slack variable, C = upper 
limit, and b = bias. 

Now, we can solve this optimization problem 
by finding the saddle point of the Lagrangian: 

^ m=l i=l /n=l 

z=l m=l 

-Z2>rc (2) 

i=l m=l 

with the dummy variables: 

af 1 = 0, pf' = 0, ^ = 0 

subject to: 


i= and (cg l,---,k}\y ; 

which is maximized with respect to a and p and 
minimized with respect to co and ^ by consider- 
ing the notation: 

,n _ f 1 if 

Ci [0 if y t * n 

and 

A = I « ; m (3) 

m= 1 

After getting the differentiation, the optimal 
a is obtained as follows: 

O r-v ri JTZ 

a.- = 2 X a,- + 
i.m 


1 Vi 


Z [ C- Z A-A- + 
Uj,m 2 J J 


m y/ 
a z * a^ 




— a; 


a j 1 1 ( x / x j) 


(4) 


Finally the decision function for multi-class 
SVM is: 

f( x ) 

= arg max[ Z A- (x- • x) - Z cq (x ■ • x) + b n ] 
n i:y t =n r.y^n 

(5) 

The inner product (xi -x) can be replaced by the 
convolution inner product K(x ., xp, also known as 
the kernel function. Some commonly used SVM 
kernels with their mathematical expressions are 
listed in Table 1. 

A graphical comparison among different pa- 
rameter values’ effect on RBF kernel is explained 
in Figure 1 for a binary class synthetic problem. 
The rectangular and cross signs indicate the two 
different classes of the problem. 

The RBF kernel with width 0.8 and 1 classifies 
all patterns perfectly with a single optimal hyper- 


103 


Kernel Width Selection for SVM Classification: A Meta-Learning Approach 


Table 1. The common uses SVM kernel functions. 


Kernels Name 

Kernel Functions 

linear kernel (Vapnik, 1995) 



polynomial kernel (Vapnik, 1995) 

K (Xj, xj ) = (x[ xj ^ ^ K(x t , Xj ) = ((xf xj ^ + l) d 

rbf kernel (Vapnik, 1995) 

K(Xi,Xj) = ex p 

( 

Xi-x/ 

X 

2 h 2 

/ where h >0 

multiquadratic kernel (Evgeniou, et al., 
2000) 

f, ,,2 ? 

K[X[,Xj)= Xj-Xj +T Z 

V 

1 

2 

where x > 0 

spline kernel (Gunn, 1998) 

K (x^ , x j ) = 1 + (xf xj ) + 1 (xf xj ) min(x ; ^ Xj ) 2 -fmm(xf xj ) 3 

sigmoidal kernel (Evgeniou, et al., 2000) 

K( x/ >*P = tanh (r Kxf x j ) + 9) where , and e parameter 

Laplace kernel (Ali and Smith, 2004a) 

K(x/,xj) = exp 

f 

x;-xj ' 

where h is the kernel smoothing parameter. 

h 

\ y 


plane. But the other parameters for RBF kernel 
construct several optimal boundaries to classify 
all the patterns. It is interesting to observe from 
Figure 1 how each RBF kernel width generates 
the optimal hyperplane and how certain kernel 
parameters are limited in their ability to find the 
optimal hyperplane for highly non-separable 
data. 

In our previous study, we found that the best 
rule for RBF kernel and the rule evaluation 
performance based on 10FCV is presented in 
Table 2. For other kernel rules, see Ali and Smith 
(2004b); the summarized kernels’ performance 
are mentioned in Appendix B. 

The best rules for RBF kernel are generated with 
c = 70% and m = 2 as follows: (see Figure 1). 

Rule # 1 . IF (range > 9 AND normal cdf > 
7.2957) OR (discrete uniform cdf <= 2.8185), 


THEN we should choose RBF kernel for SVM 
classification. 

In the following sections, we attempt to select 
the RBF kernel parameter based on data setproper- 
ties. The rule for this kernel is highly acceptable 
due to higher accuracy rating. We found that RBF 
kernel showed best classificationperformance for 
44.64% data sets. Now that we have found two is- 
sues for RBF kernel, first, should we search the best 
widthbetween 0.2 to 1.2 andhowwe can estimate 
the best RBF kernel width? We will examine two 
different RBF width estimation methods (i.e., 
maximum likelihood (ML) and Nelder-Mead 
(N-M) simplex method), present comparative 
performance results, and then attempt to gain 
insight into which method should be used for 
certain datasets. 
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Figure 1. A pictorial view of the rbf kernel performance is shown on an artificial dataset with different 
width effect. The cross and rectangular sign indicate the two classes of data. The middle continuous 
lines (for width 1) of the above graphs represent the optimal hyperplane for classification. 


(a) rbf width =0.2: 3 classification errors (b) rbf width =0.4: 1 classification errors 


(c) rbf width =0.6: 0 classification errors (d) rbf width =0.8: 0 classification errors 


(e) rbf width =1: 0 classification errors (f) rbf width =1.2: 0 classification errors 


OPTIMUM RBF KERNEL WIDTH 
SELECTION 

We consider ML and N-M simplex method to 
estimate best RBF kernel width. The ML method 
estimates the variance of the data set, and then, 
the normalized variance is considered as the width 


Table 2. Confusion matrixbased on 1 OF CV results 
for the RBF kernel selection rule 


Data 

Condition 

Satisfied 


rbf Kernel Best 

Yes (Y) 

NO (N) 

Y 

2.9 

0.7 

N 

0.7 

6.7 


Accuracy = 87.27% 
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of the RBF kernel. The N-M simplex method 
searches for the appropriate variance from the 
transformed data and then selects this value as the 
best width for RBF kernel. This non-constraint 
optimization process is a faster method than some 
other constrained optimization methods. In the 
following section, we will explain both methods 
with performance evaluation results and then 
generate rules to determine which method is best 
suited to which classification. We summarize both 
of these methods from Ross (2000) and Lagarias, 
et al. (1998), which are well studied in the statisti- 
cal community but have not been applied to RBF 
kernel parameter estimation before, to the best of 
our knowledge. 

Maximum Likelihood Method 

Likelihood method has been a very popular pa- 
rameter estimation method in the statistical com- 
munity for many years. Let us consider that x v 
•••, x n are independent, prior normal distribution 
with unknown mean (p), and standard deviation 
(a) (Ross, 2000). Now, the density function is as 
follows: 

n 

f (* 1 .-,X„ N,a)=] - [- ? =exp 


1 

2a 2 

( 6 ) 

The logarithm of the likelihood density func- 
tion is given by: 

yVi -n ) 2 

log f (xi,---,x n |n,a)=-2. 1 og(2 Jt )-nlogCT 1 

^ 2a 

(7) 

Now, after the differentiating with respect to 
p and a, we can write 


n/2 


= — — eX P 



_d_ 

dp 


log f(x 1 ,---,x n |p,o) = 


I(Xf- F) 
1 


( 8 ) 




^logf(X 1 ,---,X n |p,CT) = -^ + l 3 


(9) 


By equating these previous two equations to 
zero, we find the maximum likelihood is obtained 
when the width a of the RBF kernel is 


CT = 


f n 

£ ( x i -fO 2 
n=l 


, 1/2 


n 

where Q = L=1 — 
n 


( 10 ) 


Now, let’s consider the calculated a value as 
the smoothing parameter h of RBF kernel. Since 
a and RBF kernel parameter h are both serving 
as variance measures, we approximate h using 
equation (10). 

The effectiveness of a (= 0.1) from maximum 
likelihood estimation onwdbc dataset is shown in 
Figure 2, which made the nonnormally distributed 
wdbc dataset a normally distrubuted dataset. 


Nelder-Mead Simplex Method 


In our previous study, we observed that RBF 
kernel perform best if the data was normally 
distributed (Ali & Smith, 2003). We assume the 
data distribution is normal, if the interquantile 
range of the data set is close to 1.3. The N-M 
simplex method is suitable for finding a param- 
eter to transform data into normal distribution. 
Reshaping the problem, one can find the best 
smoothing parameter for RBF kernel h so that 
the data is effectively transformed. 

The N-M simplex method for unconstrained 
optimization has been used extensively to solve 
parameter estimation problems over a few de- 
cades. It is still the method of choice in statistics, 
engineering, and the physical and medical sci- 
ences, due to its ease of use. This method does 
not require derivatives and is often claimed to be 
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Figure 2. The effect of (a-0.1) on wdbc dataset. The suitable a value makes the non normally distrib- 
uted wdbc dataset as a normally distributed dataset. 



robust for problems with discontinuous attribute 
values (Lagarias et al. 1 998). First, we transform 
the data by following a Box-Cox transformation 
(Gentle, 2002) in order to produce data that fol- 
lows a normal distribution more closely than the 
original data: 

x(X;A.)= l ) (11) 

[logX if X = 0 

This transformation can be used only for 
positive response variables. Box and Cox (Gentle, 
2002) suggested the transformation for negative 
elements variable as follows: 


t(X;X) 


(X +8)^-1 
X 


log(X +5) 


ifA.^0 
if X=0 

( 12 ) 


Now, our aim is to find the appropriate value of 
X, which can be considered as similar to the width 
of the RBF kernel, since they are both measures 


of variance. We use N-M simplex method to find 
out the best value of X. 

Each iteration of the N-M method begins with 
a simplex, specified by its n + 1 vertices and the 
associate function values. One or several test 
points are computed in correspondence to their 
function values, and then the iteration terminates 
with a new simplex such that the function values 
at their vertices satisfy some form of descent 
condition compared with previous simplex. 

One iteration of the N-M simplex algorithm 
consists of the following steps: 

1. Order: Order and relabel the n + 1 vertices as 

x n such that flxf) <, ..., < f(x n + 1 ). Since 
we want to maximize, we refer to x x as the best 
vertex or point, to x n + 1 as the worst point and 
to x n as the next worst point. Let X refer to the 
centroid of the n best points in the vertex. 

2. Reflect: Compute the reflection point x r , x r = 
X + p(X -x n + j), where p is a parameter. 
Evaluate f(x r ). If /)</)< f n , accept the reflected 
point x r , and terminate the iteration. 
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3. Expand: If f r < f p compute the expansion point 
x , x e - X + x(x r - X ), where x is a parameter. 
If f e < f r accept x e and terminate the iteration; 
otherwise (i.e., if f e > f r ), accept x and terminate 
the iteration. 

4. Contract: If f r > f n , perform a contraction be- 
tween X and the better of x , and x, x - X 
- y(X -x n+ '), where y is a parameter. 

If f c < f n , ! , accept x c and terminate the itera- 
tion. 

5. Shrink Simplex: Evaluate f at the n new ver- 
tices for z = 1, •••, n. v. = x 1 + q(x. - x x ), where 
q is a parameter. 

Now the highest vertices point considers as 
the optimum value of X and also considers as 
the smoothing parameter h of RBF kernel. For 
the four coefficients (p, %, y, and q), the standard 
values reported in Lagarias et al. (1998) have 
been adopted. 

The effectiveness of X with Box-Cox trans- 
formation on wine dataset is shown in Figure 3. 
The suitable value of X is 0.5, which made the 
nonnormally distributed wine dataset a normally 
distributed dataset. 


RBF Width Estimation Algorithms 
Performance: Accuracy 

The average test set classification performance 
of RBF kernel with parameter 0.2-1. 2, RBF_best 
(best performance manually selected from width 
0.2-1.2), best width approximation by ML and 
NM methods is shown in Figure 4. The results 
are presented only for those 58 of the 112 original 
datasets that are suited to the RBF kernel (satisfy 
rule #1). 

TheRBFML andRBFN-Mmethods showed 
close performance with the best RBF accuracy 
found through exhaustive search of width range 
0.2 to 1.2. Both methods showed average higher 
accuracy than some individual RBF width per- 
formance. For large datasets (more than 1,000 
samples) RBF best showed average accuracy 
77.52%; RBF ML and RBF N-M methods showed 
75.94% and 71.41%. The RBF ML showed better 
performance than RBF_N-M method. Onthe other 
hand for small datasets (less than 1,000 samples) 
RBF best showed average accuracy 69.76%, and 
RBF N-M methods showed 67.26% and 64.98%. 


Figure 3. The effect of Box-Cox transformation on wine dataset (X- 0.5). The suitable X value makes 
the non nonnormally distributed wine dataset as a normally distributed dataset. 
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Figure 4. Average test set accuracy for different rbf kernel parameter fitting methods for problems 
satisfying rule # 1 (58 datasets). 



Figure 5. Average test set accuracy for different rbf kernel parameter fitting methods for problems not 
satisfying rule # 1 (54 datasets). 



Name of methods 


The RBFML again showed better performance 
than RBFN-M method for small dataset. The 
RBF ML method predicted the best width for 
RBF kernel for 29. 3 1% of the datasets, where RBF 
kernel is expected to be best. On the other hand, 
RBF N-M method predicted the best width for 
31.03% of the datasets. We observed that 24.13% 
of the datasets have best width outside the range 
of 0.2 to 1.2. For many of the datasets, RBFML 
and RBF N-M methods predicted the same RBF 
width among the 112 problems. The RBF kernel 


performance with datasets better suited to other 
(non-RBF) kernels is showed in Figure 5. 

RBF Width Estimation Algorithms 
Performance: Computational Time 

The computational performance to determine 
the best RBF width using the three methods: 
RBF best (exhaustive search of width 0.2-1. 2), 
estimation by RBF ML, and RBF N-M methods, 
as shown in Table 3. 
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Table 3. Average computational performance of the different rbf kernel width estimation methods. 


Average 
Computational 
Time in Sec. 

rbfBest r 

bf_ML 

rbf_N-M 

1269.39 0 

.5336 5 

.1350 


The exhaustive best width search method 
needed extremely higher computational time than 
RBFML and RBF N-M methods. It selects the 
RBF width one by one from a range of 0.2-1. 2 to 
train the SVM RBF model. But both RBF_ML and 
RBF N-M methods estimate the best RBF width 
for SVM by simply implementing equation (10) 
and a simple iteration of equation (12), respectively, 
that estimates the likely performance of the SVM 
model without the need to build such models. 
Therefore, RBF_ML and RBF_N-M methods 
show superior computational performance com- 
pared to the exhaustive search method. 

Significance Test 

The t-test results for different RBF kernel width 
estimation methods are summarized in Table 4. 
We considered the base method as RBFbest. The 
test input was the percentage of correct classifica- 
tion for all width estimation methods. 

The outputs of H - 0 in the previous table 
indicated that we may not reject the null hypoth- 
esis that both methods are equally significant. 
Alternatively, H = 1 means we may reject the null 
hypothesis. The RBF best with RBF width 1-1.2 
showed significant performance difference. The 


lower values of the significance level suggested 
rejecting the null hypothesis. The 95% confidence 
intervals for these methods are highly positively 
skewed, as shown in Table 4. But the RBF_best 
with RBF width 0.2-0.8, RBF ML and RBF N- 
M methods showed no significant performance 
difference. The higher values of the significance 
level suggested accepting the null hypothesis. 
The 95% confidence intervals for these kernels 
are highly balance skewed, as shown in Table 
4. The RBF ML and RBF N-M methods give 
results comparable to exhaustive search but are 
much faster to implement. 

The average percentage of classification 
performance and significance testing has shown 
that classification accuracy depends on particular 
RBF kernel width selection. A detailed RBF best 
width estimation performance by RBF_ML and 
RBF N-M methods is represented in Appendix C. 
We observe from both of these best RBF widths 
estimation methods that the best width could be 
outside the range 0.2-1.2. Any single method is 
not always best to estimate the best RBF width 
for all problems. So, we need a method to provide 
a priori information about which best width esti- 
mation method is suitable for which classification 
problem with SVM. 


Table 4. Results of the t-test for all methods of rbf width selection. 


Algorithms H 

ypothesis 

H 

Significance 

5 

Confidence Interval 

Cl 

rbf best vs rbf 0.2 

0 

0.4539 

-7.6381 3 

.4257 

rbf best vs rbf 0.4 

0 

0.4368 - 

7.8229 

3.3904 

rbf best vs rbf 0.6 

0 

0.2093 - 

9.4083 2 

.0726 

rbf best vs rbf 0.8 

0 

0.0732 

-11.1674 0 

.5058 

rbf best vs rbf 1 

1 

0.0346 - 

12.2829 

-0.4664 

rbf best vs rbf 1.2 

1 

0.0171 

-13.1628 - 

1.3012 

rbf best vs rbf ML 0 

0 

.4064 

-7.8649 

3.1961 

rbf best vs rbf NM 0 

0 

.0746 

-11.2024 

0.5342 


no 
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In the following section, we describe the 
methodology we use to assist in the appropriate 
selection of the best width estimation method for 
a given dataset. First, each dataset is described by 
a set of measurable meta characteristics; we then 
combine this information with the performance 
results; and finally, use a rule-based induction 
method to provide rules describing when each 
best parameter estimation method for RBF kernel 
is likely to perform well. 

Dataset Characteristics 
Measurement 

Each dataset can be described by simple-, dis- 
tance-, and distribution-based statistical measures 
(Smith et al., 2001, 2002). These three sets of mea- 
sures characterize the datasets in different ways. 
First, the simple classical statistical measures 
identify the data characteristics, based on variable 
to variable comparisons. Then, the distance-based 
measures identify the data characteristics based 
on sample-to-sample comparisons. Finally, the 
density-based measures consider the single data 
point from a matrix to identify the dataset char- 
acteristics. We average most statistical measures 
with all the variables and take these as global 
measures of the dataset characteristics. 

Simple Statistical Measures 

Descriptive statistics can be used to summarize 
any large dataset into a few numbers that con- 
tain most of the relevant characteristics of that 
dataset. The following table lists the statistical 
measures used in this work, as provided by the 
Matlab Statistics Toolbox and some other dif- 
ferent sources (Mandenhall & Sincich, 1995; 
Tamhane & Dunlop, 2000) as follows: 


Meta Attribute Names 

Meta Attribute Names 

Geometric Mean 
Harmonic Mean 

Trim Mean 

Standard Deviation 
Interquantile Range 

Max. and Min. Eigenvalue 
Skewness 

Kurtosis 

Correlation Coefficient 
Pretile 


Distance-Based Measures 

Distance-based measures calculate the dissimilar- 
ity between samples. We measure the euclidean, 
city block, and mahalanobis distance between each 
pair of observations for each dataset as follows: 


Meta Attribute Names 

Meta Attribute Names 

Euclidean Distance 

City Block Distance 

Mahalanobis Distance 


Distribution-Based Measures 

The probability distribution of a random variable 
describes how the probabilities are distributed 
over the various values that the random variable 
can take. We measure the probability density 
function (PDF) and cumulative distribution 
function (CDF) for all datasets by considering 
different types of distributions as follows: 


Meta Attribute Names 

Meta Attribute Names 

Chi-Square PDF 

Normal PDF 

Binomial PDF 
Exponential PDF 

Gamma PDF 

Lognormal PDF 
Rayleigh PDF 

Chi-Square CDF 

Normal CDF 

Discrete Uniform CDF 

F PDF 

Hypergeometric CDF 
Poisson PDF 

Student’s t PDF 


These measures are calculated for each of 
the datasets to produce a dataset characteristics 
matrix. Finally, by combining this matrix with 
the performance results in Appendix C, we can 
derive rules to suggest when certain best width 
estimation methods are appropriate. 

Rule Generation 

The trial-and-error approach is a very common 
procedure to select the best width for RBF kernel. 
It is a computationally complex task to find the 
best width by following this procedure. If we are 
interested in applying a specific method to a par- 
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ticular problem, we have to consider which method 
is more suitable for which problem. The suitability 
test can be done from rules developed with the 
help of the data characteristics properties. 

Rule-based learning algorithms, especially 
decision trees (also called classification trees or 
hierarchical classifiers) are a divide-and-conquer 
approach or a top-down induction method, which 
has been studied with interest in the machine 
learning community. Quinlan (1993) introduced 
the C4.5 and C5.0 algorithms to solve classifica- 
tion problems. C5.0 works in three main steps. 
First, the root node at the top node of the tree 
considers all samples and passes them through to 
the second node called branch node. The branch 
node generates rules for a group of samples based 
on an entropy measure. In this stage, C5.0 con- 
structs a very big tree by considering all attribute 
values and finalizes the decision rule by pruning, 
ft uses a heuristic approach for pruning based on 
statistical significance of splits. After fixing the 
best rule, the branch nodes send the final class 
value in the last node, called the leaf node (Duin, 
1996; Quinlan, 1993). C5.0 has two parameters: 
the first one is called the pruning confidence factor 
(c), and the second one represents the minimum 
number of branches at each split (m). The prun- 
ing factor has an effect on error estimation and, 
hence, the severity of pruning the decision tree. 
The smaller value of c produces more pruning of 
the generated tree and a higher value results in 
less pruning. The minimum branches m indicates 
the degree to which the initial tree can fit the data. 
Every branch point in the tree should contain at 
least two branches (so, a minimum number of m - 
2). For detail formulations, see Quinlan (1993). 

Now that the characteristics of each dataset can 
be quantitatively measured, we can combine this 
information with the empirical evaluation of RBF 
width estimation performance and construct the 
dataset characteristics matrix. Thus, the result of 
the jth width selection method on the ith dataset 
is calculated as: 


e,; -maxfe,) 

flu =1- J — (13) 

'J mince^-maxtej-) v ' 

where e.. is the percentage of correct clas- 
sification for the jth method on dataset i, and e. 
is a vector of accuracy for dataset i. The class 
values in the matrix are assigned based on the 
performance best rank. The best rank is defined 
as 1, and the worst is 0. For example, if RBF ML 
method shows the ranking performance 1 for the 
dataset A, then the class in the matrix for problem 
A is RBF ML. Based on the 112 classification 
problems, we then can train a rule-based classifier 
(C5.0) to learn the relationship between dataset 
characteristics and width selection method per- 
formance. We split the matrix 90% to construct 
the model tree. The process then is repeated us- 
ing a 10-fold cross validation approach so that 
10 trees are constructed. From these 10 trees, the 
best rules are found for each best width selection 
method based on the best test set results. The 
generalization of these rules is then tested by 
applying each of the randomly extracted test sets 
and calculating the average accuracy of the rules, 
as discussed below in Tables 5 and 6. We found 
the suitable parameter value by tuning for global 
pruning factor; c is 70-90%, and the number of 
minimum branches m is 2. 

We have demonstrated the rules for RBF kernel 
in the second section. Now, if any dataset satisfies 
the RBF kernel rule, then we need to find best 
RBF width estimation method. So, in the follow- 
ing section, we will generate the rules describing 
when to choose the RBF ML and RBF N-M 
methods for best RBF width estimation. 

Rules for RBF_ML Method 

The best rules for RBF_ML method are generated 
with c = 85% and m = 2 as follows: 

Rule # 2. IF (md rs <= 4.3713 and y bio pdf > 2.7036) 
or (mean > 49.0052 and s > -1.7911, and y ,, 
<= 0.00030981) or (p chi cdf <= 30.3236), THEN we 
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Table 5. Confusion matrix based on 1 OFCV results 
for the RBF_ML method selection rule 


Data Condition 
Satisfied 


rbf_Ml Method Best 

Y 

N 

Y 

2.7 

0.2 

N 

0.3 

2.6 


Accuracy = 91.38% 


Table 6. Confusion matrix based on 1 OFCV results 
for the RBF_N-M method selection rule 


Data Condition 
Satisfied 


RBF_N-M Method Best 

Y 

N 

Y 

2.9 

0.2 

N 

0.4 

2.3 


Accuracy = 89.66% 


should choose RBFML method for RBF kernel 
width estimation. 

Rules for RBF_N-M Method 

The best rules for RBF_NM method are generated 
with c = 80% and m = 2 as follows: 

Rule # 3. IF (s > -0.4284 and y <= 0.23362 

v J norm pdi 

and y ray „ pd f <= 2 - 3497 ) or (y bi0 _pdf > 2.7036) THEN 
we should choose RBF NM method for RBF 
kernel width estimation. 

The generated mles show around 90% accuracy. 
On average, we observed that RBF_ML approxi- 
mation method showed slightly better performance 
than RBFN-M approximation. However, which 
method was best for individual datasets has been 
shown to be quite data dependent. These rules 
might be useful to determine which RBF width 


approximation method is most appropriate for 
which problem. 

CONCLUSION 

In this research, we have widely investigated 
empirically how to select RBF kernel and its best 
width for SVM. We proposed a simple rule for 
RBF kernel, based on data set information. This 
method is faster than trial-and-error-based kernel 
selection with SVM. We observed that the suit- 
able RBF width could be out of the range 0.2 and 
1.2, which is the commonly tested range in the 
literature. The estimated higher width increased 
the kernel performance accuracies for some spe- 
cific cases. The RBF_ML and RBF N-M methods 
are very fast to estimate the best RBF width. 
We examined the generated rules 10-fold cross 
validation evaluation. All generated rules showed 
higher efficiency rating. The main benefit of our 
methodology based on meta-learning is that we 
can achieve higher accuracy for some classifica- 
tion problems and significant savings in time for 
at least similar accuracy for all problems. 
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ABSTRACT 

In the context of biolife science, predicting the folding structure of a protein plays an important role for 
investigating its function and discovering new drugs. Protein folding recognition can be naturally cast 
in the form of a multicategory classification problem that appears challenging due to the high number 
of folds classes. Thus, in the last decade, several supervised learning methods have been applied in 
order to discriminate between proteins characterized by different folds. Recently, discrete support vec- 
tor machines have been introduced as an effective alternative to traditional support vector machines. 
Discrete SVM have shown to outperform other competing classification techniques both on binary and 
multicategory benchmark datasets. In this chapter, we adopt discrete SVM for protein folding classifi- 
cation. Computational tests performed on benchmark datasets empirically support the effectiveness of 
discrete SVM, which are able to achieve the highest prediction accuracy. 


INTRODUCTION 

Proteins are sequences of amino acids organized 
into three-dimensional structures that largely 


influence their function and evolution. The pre- 
diction of the three-dimensional structure of a 
protein is a challenging problem that has many 
applications in discovering new drugs and thera- 
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pies, and for which different approaches have 
been proposed. The early efforts were aimed 
at predicting the function based on sequence 
similarity comparison (Holm & Sander, 1999). 
However, it has been observed that this approach 
may fail since sometimes proteins with similar 
functions can differ substantially in terms of 
primary sequence structure. 

The alternative taxonometric approach relies 
on predicting the protein fold, which can be de- 
fined as a common three-dimensional pattern with 
the same major secondary structure elements in 
the same arrangements and with the same topo- 
logical connections (Craven, Mural, Hauser, & 
Uberbacher, 1995). Hence, the problem can be 
cast in the form of a multicategory classification 
task in the context of learning from data, where 
one has to determine an explanatory relationship 
between protein folding and the underlying pri- 
mary structure (Baldi & Brunak, 1998; Durbin, 
Eddy, Krogh, & Mitchison, 1998). 

Notice that multicategory classification has 
proven to be a much more complex task in com- 
parison to its binary counterpart, particularly 
for protein folding recognition, due to the large 
number of distinct folds, which can be in the 
order of hundreds. For this reason, algorithms 
for protein folding classification are usually 
characterized by a low degree of prediction ac- 
curacy and require a high computational effort. 
This complexity has been partially mitigated by 
confining the attention only to the most populated 
classes of folds. Sometimes, folds are grouped 
into four structural classes that correspond to a 
higher concept, with respect to the folds, in the 
hierarchical representation of proteins. Hence, the 
protein folding recognition appears much harder 
to accomplish than the prediction of the structural 
class of a protein. 

In general, we can distinguish multicategory 
classifiers into two main groups. Some techniques 
directly address the multicategory nature of the 
classification task, such as classification trees 
(Breiman, Friedman, Olshen, & Stone, 1984; 


Murthy, 1998; Quinlan, 1993). Other methods, 
which appear more effective, are based on a 
sequence of binary classification problems. Re- 
ferring to protein folding, the classification of a 
new protein is then performed by assigning the 
fold that is closest to the predictions of the binary 
classifiers according to a suitable metric. To some 
extent, the approaches developed are not tied to a 
specific binary classifier. A common scheme for 
deriving a multicategory classifier is based on the 
one-against-all framework, in which binary clas- 
sification problems are obtained with the aim of 
discriminating between examples of one class and 
all the remaining examples (Dietterich & Bakiri, 
1995; Guruswami & Sahai, 1999). If D denotes the 
number of different folds, in this scheme one has 
to train only D binary classifiers, although each 
of them is rather complex due to the heterogene- 
ity of the class collecting the remaining folds. A 
different scheme, termed pairwise decomposition 
or round-robin, has been devised by letting the 
binary problems to discriminate among all pairs 
of folds (Krefiel, 1999). In this case, D (D-l)/2 
binary classifiers have to be trained, where each of 
them is composed by a small number of proteins 
belonging to one of two homogeneous fold classes. 
Finally, other hybrid schemes can be devised, as 
the unique- one- against- all proposed in Ding and 
Dubchak (2001), where a cascading combination 
of the two previous schemes is adopted. 

Discrete support vector machines, originally 
introduced in Orsenigo and Vercellis (2003, 2004), 
are a successful alternative to SVM that is based 
on the idea of accurately evaluating the number 
of misclassified examples instead of measuring 
their distance from the separating hyperplane. 
Starting from the original formulation, discrete 
SVM have been effectively extended in sev- 
eral directions, to deal with multiclass problems 
(Orsenigo & Vercellis, 2007a) or to learn from a 
small number of training examples (Orsenigo & 
Vercellis, 2007b). 

In this chapter, we perform protein folding 
classification by means of discrete SVM. In order 
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to assess the usefulness of the proposed method, 
some computational tests have been performed 
on benchmark datasets composed of proteins 
grouped into 27 different folds. The performance 
exhibited by the proposed method appears superior 
to that achieved by the best alternative classifica- 
tion techniques. 

The chapter is organized as follows. In the 
next section, we provide a description of SVM 
and discrete SVM. Then, the different schemes 
for multicategory classification, based on discrete 
SVM, are introduced in the subsequent section. 
Finally, computational tests are illustrated in the 
last section. 


SVM AND DISCRETE SVM 

In a classification problem, we are required to 
discriminate among examples belonging to dif- 
ferent classes. Formally, a training set S m com- 
posed by m examples (x., y.), l,2,...,m}, in 

the (n+l)-dimensional real space fK n+1 is given, 
where x.efK" is a vector of attributes or features 
and y. is a scalar representing the label or class 
of x.. Let V={1,2,...,D} be the set of distinct class 
values. Each component x of an example x. is 
assumed to be a realization of a random variable 
A., jeN={l,2,...,n}, representing the j-th attribute 
of S . 

m 

LetTi denote a set of functions f(x):9 R" i— >T> 
that represent hypothetical relationships between 
x. and y A classification problem consists of 
defining an appropriate hypotheses space 7 i 
and a function f*e 7 i that optimally describes 
the relationship between the examples and their 
class values. A function f *e 7i can be considered 
optimal according to different criteria that should 
take into account the minimization of the empiri- 
cal classification error on the training set and the 
maximization of the generalization capability on 
new data. 

To assess the accuracy of f e 7i, the whole 
set of examples is usually partitioned into two 


disjoint subsets, denoted respectively as training 
and test set. For a given classifier, the discriminant 
function is learned using the examples from the 
training set, and then applied to predict the class 
of the examples in the test set. In the remainder of 
this section, we will refer to binary classification 
problems for which the class attributey . takes only 
two different values, which may be labeled as -1 
and 1 without loss of generality. 

Most binary classifiers actually generate as 
output a function g R" — > 9R, termed score func- 
tion or margin, whose sign discriminates between 
the two classes, so that f(x) = sgn(g(x)). Moreover, 
for these binary margin classifiers, the magnitude 
of the score g(x) can be viewed as a measure of 
confidence in the class assignment, therefore ex- 
pressing the likelihood that the example x belongs 
to the predicted class f(x) = sgn(g(x)). 

Linear separating score functions have been 
widely used for binary classification, often as a 
building block within more complex schemes for 
pattern recognition. In this case, g(x) = w'x-b is a 
hyperplane and f(x) = sgn(w'x-b). If the training 
examples are linearly separable, it is possible to find 
a pair (w,b) such that f(x.) =sgn(w'x.-b)=y ., i eM, by 
solving a linear optimization problem. Conversely, 
when the examples are not linearly separable, for 
every choice of the parameters (w, b ) there exists a 
subset M tM such that f(x.) =sgn(w'x.-b)^y.,f e M\ 
Hence, in order to derive the optimal separating 
hyperplane, one is led to define a suitable loss 
function whose expectation with respect to the 
unknown distribution of the examples has to be 
minimized. 

The structural risk minimization (SRM) 
principle, developed in the context of statistical 
learning theory (Vapnik, 1995, 1998), establishes 
the concept of reducing the empirical classifica- 
tion error as well as the generalization error in 
order to achieve a higher accuracy on unseen 
data. Formally, this is obtained by minimizing 
the following risk functional: 

kn=-T J v(y i ,f(x i ))+i\\f( K , a) 

m inM 
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where the first term, based on the loss function V, 
is termed empirical risk, and represents the empiri- 
cal error on the training set S , whereas the second 
term is related to the generalization capability of 
f. Here, K - K(-, •) is a given symmetric positive 
definite function named kernel, f\\ denotes the 
norm of fin the reproducing kernel Hilbert space 
induced by K, and X is a parameter that controls 
the trade-off between the two terms. 

This leads to the minimization of the expres- 
sion: 


distances of the misclassified examples from the 
separating hyperplane. 

More specifically, for SVM the loss function 
V takes the form, 

V (y, , f (x, )) = |l- y, (w'x, - b ) | + , (3) 

where |t| = t if t is positive and zero otherwise. 

Let d ; ,zeM, a nonnegative slack variable 
such that the following linear constraints are 
satisfied: 


2 

— / IdS-LM-A,— -2-, 

m £ S 2 

where Iw^ = ^ w] (2) 

jsN 

The second term in (2) is the reciprocal of 
the margin of separation, defined as the distance 
between the pair of parallel canonical supporting 
hyperplanesw'x - b - 1 = Oandw'x - b + 1 = 0. The 
geometrical interpretation of the canonical hyper- 
planes and the margin is given in Figure 1. 

According to the SRM principle, the first term 
in (1) and (2) expresses the misclassification rate 
of the examples in the training set. However, 
for computational reasons, SVM replace this 
term with a continuous proxy of the sum of the 



y ; (w'x ; -b)>l-d„ i e M. (4) 

The optimal separating hyperplane can 
therefore be determined by solving the following 
quadratic optimization problem: 

u ^|| ii- 

mm z/i+vlrlk 

i'seM 2 

s. to y t (w^ - b) > 1 - cf ieM (QSVM) 

d ; > 0, z g M ; w, b free, (5) 

where X is a parameter available to control the 
trade-off between the generalization capability 
of the classifier and the misclassification error. 


Figure 1. Margin maximization for linearly nonseparable sets 
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The solution of the quadratic problem (QSVM) is 
obtained via Lagrangean duality, and also provides 
the interpretation of the support vectors ( Vapnik, 
1995). Taking advantage of the dual formulation 
and of suitable kernel functions (Cristianini & 
Shawe-Taylor, 2000; Scholkopf & Smola, 2002), 
SVM proceed by projecting the original examples 
into a higher dimensional feature space, in which 
the linear separation is derived, allowing to ef- 
ficiently obtain nonlinear discriminations in the 
original space. 

A different family of classification models, 
termed discrete SVM, has been introduced in 
Orsenigo and Vercellis (2003, 2004). and is mo- 
tivated by an alternative loss function that, ac- 
cording to the SRM principle, counts the number 
of misclassified examples instead of measuring 
their distance from the separating hyperplane. The 
distinctive trait of discrete SVM is the accurate 
representation of the empirical error, by using 
the total misclassification error in the objective 
function, in place of the sum of the slacks consid- 
ered in (QSVM) by traditional SVM. Hence, the 
rationale behind discrete SVM is that a precise 
evaluation of the empirical error could possibly 
lead to a more accurate classifier. 

In this case, the loss function is given by, 

V (y, , f (x ; )) = c ; 0 (1 - y. (w'x, - b)\ (6) 


where 9(0=1 if t is positive and 0 otherwise, and c. 
is a penalty for the misclassification of the example 
x . In absence of any domain specific natural cost 
attribution, c. can be taken equal to the percentage 
of examples not belonging to the class y .. 

The inclusion of the loss (6) into the risk 
functional (2) leads to the minimization of the 
following expression: 


1 

m 


z 

ieM 


Yi - f (x, ) | + A, 



where w 



je N 


( 7 ) 


Notice thatthe firstterm in (7) precisely evalu- 
ates the empirical risk and expresses the accuracy 
of the classifier on the training set through the 
percentage of misclassified examples. 

In order to formulate the optimization problem 
corresponding to the minimization of the risk 
(7), discrete SVM first replace the 2-norm in the 
second term of (7) with the 1-norm, 

llwll = y . ir W; . 

II 111 Z— ijeAM 1 \ 

In order to express the empirical risk, we 
introduce the following binary variables, 

0 if x. is correctly classified 

! z e M 

1 if x ; is misclassified 

used to count the number of classification er- 
rors. 

The complexity of the discriminating rule is 
related to the number of attributes that directly 
contribute, with a nonzero coefficient, to the sepa- 
rating hyperplane. By reducing the complexity of 
the discriminating rules, it is likely to increase 
the generalization capability of a model learned 
on the training set. Furthermore, the induced 
rules become simpler and more suitable to the 
interpretation of domain experts. Hence, in order 
to take into account the complexity of the rule, 
discrete SVM introduce a second set of binary 
variables, defined as: 

[0 if w f = 0 

q.=\ „ 1 jeN. 

|l if Wj* 0 J 

Let h , j&N be the penalty cost for using at- 
tribute j. Discrete SVM determine the optimal 
separating hyperplane by solving the following 
mixed-integer linear optimization problem, 

u z + YlMi 

jeN 

(DSVM) 

s.to y ; (w'x ; -b)>l-Spi, ieM (11) 


CL v— p \r— 

mm — L c iP,+~L 

m i^M jeN 



120 


Protein Folding Classification Through Multicategory Discrete SVM 


-u. < w, < u,, / e N 

J J J ’ J 

(12) 

Uj< R qj, jeN 

(DSVM) 

p ; e {0,1}, ieM; 


Uj > 0, q } g (0,1), j g N] w, b free 

(13) 


The objective function of problem (DSVM) 
is composed of the weighted sum of three terms, 
expressing a trade-off between accuracy and 
potential of generalization, regulated by the pa- 
rameters (a, p, y). The first term represents the 
empirical error. The second term expresses the 
1-norm computed with respect to a linear kernel, 
and its role is to restore the well-posedness of the 
optimization problem and to increase the predic- 
tive capability of the classifier. Finally, the third 
term is aimed at further increasing the generaliza- 
tion capability of the model by minimizing the 
number of attributes used in the classification rule. 
Constraints (11) are required in order to correctly 
evaluate the empirical error through the binary 
variables p, as each of them forces to the value 
1 the binary variable p. whenever example x is 
misclassified, that is, when it falls on the wrong 
side of the corresponding canonical hyperplane. 
Here S is an appropriate large constant. Constraints 
(12) ensure that the components of u bound the 
absolute value of the elements of the vector w; 
hence, upper bounding its 1-norm ||w|| . Finally, 
constraints (13) imply that the binary variable 
q takes the value 1 whenever w. > 0, that is, 
whenever the j-th attribute is actively used in 
the optimal separating hyperplane. Like S, R is 
a large constant. 

Model (DSVM) is a mixed binary linear op- 
timization problem, notoriously more difficult to 
solve to optimality than continuous linear optimi- 
zation. However, it can be solved by means of an 
efficient heuristic procedure, based on a sequence 
of linear optimization problems, for obtaining 
suboptimal solutions. Model (DSVM) can be 
used as a linear perceptron; alternatively, it can 
be framed within a recursive procedure for the 


generation of oblique classification trees, to derive 
an optimal separating hyperplane at each node 
of the tree, as in Orsenigo and Vercellis (2003, 
2004). In the quoted references, it was shown, by 
means of extensive testing, that the increase in 
model complexity is justified by a more accurate 
discrimination and a higher generalization capa- 
bility due to the correct estimation of the empirical 
misclassification error and the minimization of 
the number of attributes defining the separating 
hyperplane. 

Here is a short description of the heuristic 
procedure for determining a feasible suboptimal 
solution to model (DSVM), based on a sequence 
of linear optimization (LO) problems. The heu- 
ristic starts by considering the LO relaxation of 
problem (DSVM). Each LO problem (DSVM) t+1 
in the sequence is obtained by fixing to zero the 
relaxed binary variable with the smallest fractional 
value in the optimal solution of the predecessor 
(DSVM) t . Notice that, ifproblem (DSVM) t is fea- 
sible and its optimal solution is integer feasible, the 
procedure is stopped, and the solution generated 
at iteration t is retained as an approximation to the 
optimal solution of problem (DSVM). Otherwise, 
ifproblem (DSVM) t+1 is unfeasible, the procedure 
modifies the previous LO problem (DSVM) t by 
fixing to 1 all of its fractional variables. Problem 
(DSVM) (+1 defined in this way, is feasible, and 
any of its optimal solutions is integer. Thus, the 
procedure is stopped and the solution found for 
(DSVM) t is retained as an approximation to the 
optimal solution of (DSVM). 

DISCRETE SVM FOR 
MULTICATEGORY CLASSIFICATION 

In this section, we consider two techniques for 
extending discrete SVM to multicategory clas- 
sification problems. The first is obtained by 
framing model (DSVM) within a one-against-all 
scheme, where the output of K binary classifiers 
is combined to derive the desired multiclass 
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discrimination. The second method is based on 
a round-robin scheme, where K(K- 1)/2 binary 
classification problems have to be solved. 

Suppose we have a collection of binary clas- 
sifiers £>,, k=l,2,...,K, trained to discriminate two 
subsets of classes in V. In what follows, we will 
assume that each classifier B, is based on model 

K 

(DSVM). In order to specify which combination 
of classes is presented to each classifier, a matrix 
Ve {— 1 , 0 , 1 } rjxK is assigned, with the following inter- 
pretation: if either v dk = -1 or v = 1 , the examples 
for which y -d are presented to classifier B with 
revised class value - 1 or 1 , respectively; if instead 
, v d ,= 0 the examples belonging to class d are not 
entailed into the k-th binary classification. Denote 
by g (x) the score function assigned by classifier 
B k to example x . Thus, given a new example x, the 
multicategory classifier applies the K algorithms 
B k , k = 1,2,..., K, obtaining a vector of outputs 
g(x) = (g 1 (x),g 2 (x),...,g K (x)), and predicts for x the 
class value d for which row v . of the matrix V is 

d 

“nearest” to g(x) with respect to a suitably defined 
metric. More precisely, the multicategory classi- 
fier requires a distance 8 :tR K x{-l,0,l} K — >91 
be defined, and assigns to the new example x the 
class value y - argmin d 8(g(x),v d ). A similar 
multicategory classification scheme is called er- 
ror correcting output codes (ECOC) (Allwein, 
Schapire, & Singer, 2000). 

The first specific method we consider within 
the ECOC scheme is the popular one-against- 
all classifier, also termed one-vs.-others. In this 
case, K - D and each classifier B ( is required 
to discriminate between a single class in one 
subset and all the remaining classes in the other. 
Therefore, the D x D matrix V is defined as: 

= 1 and v ,, = -1 for d ^ k. As each classifier is 

dk 

represented by model (DSVM), it is natural to 
define the distance 8 as, 


8(g(x),v„) 


K 

9 it ( x ) max( v df( , 0) 


_k= 1 


( 8 ) 


The effect of this metric is to predict, for the 
new example x, the class value y for which the 
score is maximum, that is, 

y = arg max, g k (x) = arg max, (w'x - b k ) 

( 9 ) 

Intuitively, this corresponds to assigning to 
the example x the class value y, which is more 
likely when at least one of the components of the 
vector g(x) is positive, that is, g,(x) > 0 for some k. 
If instead all the components of g(x) are nonposi- 
tive, the choice ofy in (9) corresponds to the less 
unlikely class assignment. Geometrically, this is 
equivalent to labeling example x with the class 
whose separating hyperplane lies furthest among 
the classes d for which g d (x) > 0; if instead g,(x) 
< 0 for every k, then the assignment of y picks 
up the class for which the separating hyperplane 
is nearest to x. The multicategory classifier de- 
rived by embedding model (DSVM) within the 
one-against-all scheme described will be denoted 
asDSVM 0AA . 

A second multicategory classifier that can be 
defined within the ECOC framework is repre- 
sented by a round-robin or pairwise decomposi- 
tion scheme, in which each class is discriminated 
against each other in all K=D(D- 1)/2 possible 
ways. The resulting matrix V is formed by all D 
dimensional columns that include exactly a -1 
and a 1, having all the remaining entries equal 
to 0. In this case, a more appropriate choice for 
8 is represented by the Hamming distance, also 
known as voting, defined as, 

5 (g(x), v d ) = 0.5^ (1 - sgn(g, (x))v d , 

k= 1 

= 0.5X(1 -f k (x))v dk . (10) 

k = 1 

This distance is equivalent to assigning to 
example x the class value y that received most 
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votes among all K-D(D- 1)/2 binary pairwise 
classifiers applied to x. The multicategory clas- 
sifier obtained by embedding model (DSVM) 
within the round-robin scheme described will be 
denoted as DSVM^. 

COMPUTATIONAL TESTS 

The prediction ability of discrete SVM has been 
evaluated on two benchmark datasets that have 
been extensively used in the context of protein 
folding recognition in order to compare alterna- 
tive classification methods (Ding & Dubchak, 
2001; Dubchak, Muchnik, Holbrook, & Kim, 
1995; Dubchak, Muchnik, Mayor, Dralyuk, & 
Kim, 1999). These datasets do not contain highly 
homologous protein sequences. In particular, the 
first dataset, collected by Dubchak et al. (1999) and 


generally used for the training process, consists 
of 313 proteins having no more than 35% identity 
with each other. The second dataset, utilized as 
an independent test sample, is the PDB-40D set 
developed by the authors of the SCOP database 
(Andreeva, Howorth, Brenner, Hubbard, Chothia, 
& Murzin, 2004; Murzin, Brenner, Hubbard, & 
Chothia, 1995). It contains 385 proteins possessing 
less than 40% of the sequence identity. Further- 
more, all the proteins in the testing sample have 
less than 35% identity with the proteins contained 
in the training dataset. The two datasets refer 
to 27 folds, each represented by at least seven 
proteins. These folds are grouped with respect to 
proteins structural classes, which can be one of 
four categories: all a, all p, a + P, and a / p. Table 
1 shows the folds included into each structural 
class, with the number of proteins falling into the 
training and the testing datasets. 


Table 1. Structural classes, folds, and cardinalities of the training and the test datasets 


Structural class 

Fold index 

n° train 

n° test 

a 

1 

13 

6 

a-helix secondary 

3 

7 

9 

structure 

4 

12 

20 


7 

7 

8 


9 

9 

9 


11 

7 

9 

P 

20 

30 

44 

(3-sheet secondary 

23 

9 

12 

structure 

26 

16 

13 


30 

7 

6 


31 

8 

8 


32 

13 

19 


33 

8 

4 


35 

9 

4 


39 

9 

7 

a / p 

46 

29 

48 

mixed or alternating 

47 

11 

12 

a-helix and p-sheet 

48 

11 

13 

segments 

51 

13 

27 


54 

10 

12 


57 

9 

8 


59 

10 

14 


62 

11 

7 


69 

11 

4 

a + p 

72 

7 

8 

a-helix and P-sheet 

87 

13 

27 

segments not mixed 

110 

12 

27 
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In order to perform protein folding recognition 
by means of machine learning methods, proteins 
must be converted into vectors of numerical or cat- 
egorical attributes. In Ding and Dubchak (2001), 
six sets of attributes are extracted from amino 
acids sequences. These are amino acids compo- 
sition “C” (20 attributes), predicted secondary 
structure “S” (21 attributes), hydrophobicity “H” 
(21 attributes), normalized van der Waals volume 
“V” (21 attributes), polarity “P” (21 attributes), 
and polarizability “Z” (21 attributes). In order to 
train the classification algorithms, these sets of 
attributes can be used alone or can be mixed in 
different combinations. For instance, one might 
train a classifier based on the combination of the 
attributes in the sets C, S, and H. In the sequel, the 
notation “CSH” means the union of the attributes 
in the sets C, S and H. 


The effectiveness of a classifier is generally 
evaluated in terms of the overall prediction ac- 
curacy it is able to achieve on a test sample. 
Specifically, if z denotes the number of correctly 
classified proteins and a is the size of the test 
sample, the overall accuracy is computed as Q - 
z / a. However, for protein folding classification, 
it is worth also to evaluate the accuracy for each 
fold, obtaining 27 values on our test sample, 
given by the ratios Q d - z d / a d , d e V where z d 
and a d denote, respectively, the number of cor- 
rectly classified proteins and the total number of 
proteins for fold d. 

Two main classifiers were considered for the 
computational testing: the one-against-all and 
the round-robin methods, denoted respectively 
as DSVM oaa and DSVM flj; and described in the 
previous section, each embedding discrete SVM 


Table 2. Prediction accuracy (%) on the test sample achieved by direct methods 



Attributes 




Method 

C 

CS 

CSH 

CSHP 

CSHPV 

ALL 

SVM * 

V OAA 

43.5 

31.5 

45.2 

- 

- 

- 

SVM UOAA * 

49.4 

36.2 

51.1 

- 

- 

- 

dsvm oaa 

28.1 

37.1 

39.0 

41.7 

41.2 

38.3 

dsvm rr 

37.7 

47.5 

51.4 

43.3 

48.2 

42.3 


* Results from Ding and Dubchak (2001) 


Table 3. Prediction accuracy (%) on the test sample achieved by alternative voting methods 



Voting predictions 




Method 

{C} 

{C+CS} 

{C+CS+CSH} 

{C+CS+CSH+ 

CSHP} 

{C+CS+CSH+ 

CSHP+CSHPV} 

{ALL} 

V-NN * 

v OAA 

20.5 

36.8 

40.6 

41.1 

41.2 

41.8 

V-SVM * 

OAA 

43.5 

43.2 

45.2 

43.2 

44.8 

44.9 

v-svm rr * 

44.9 

52.1 

56.0 

56.5 

55.5 

53.9 

v-svm uoaa * 

49.4 

48.6 

51.1 

49.4 

50.9 

49.6 

v-dsvm oaa 

28.1 

39.7 

42.6 

44.7 

43.9 

42.3 

v-dsvm rr 

37.7 

50.4 

57.4 

58.2 

57.1 

53.2 


* Results from Ding and Dubchak (2001) 
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model (DSVM) as the base binary classifier. Each 
algorithm was trained using different combina- 
tions of the set of attributes formedby C, CS, CSH, 
CSHP, CSHPV, and CSHPVZ, where this latter 
will be denoted as ALL hereafter. By this way, 
the combination of the six attributes groupings 
with the two methods DSVM OAA and DSVM fiJ? 


originated 12 distinct classifiers, indicated as 
direct in the sequel. 

Besides these models, 12 further ensemble 
classifiers were derived by applying the voting 
scheme proposed in Ding and Dubchak (2001). 
To explain how these ensemble methods were 
generated, consider, for example, the three mod- 


Table 4. Prediction accuracy (%) for each fold of the test sample using DSVM QAA in the voting frame- 
work 


Voting -V-DSVM om 

Fold 

index 

{C} 

{C+CS} 

{C+CS+CSH} 

{C+CS+CSH+ 

CSHP} 

{C+CS+CSH+ 

CSHP+CSHPV} 

{ALL} 

1 

83.3 

83.3 

83.3 

83.3 

83.3 

83.3 

3 

66.7 

77.8 

88.9 

88.9 

88.9 

88.9 

4 

20.0 

20.0 

20.0 

25.0 

20.0 

20.0 

7 

25.0 

37.5 

50.0 

50.0 

50.0 

50.0 

9 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

11 

33.3 

55.6 

55.6 

55.6 

44.4 

44.4 

20 

25.0 

54.5 

68.2 

79.5 

79.5 

68.2 

23 

41.7 

50.0 

50.0 

50.0 

50.0 

33.3 

26 

15.4 

15.4 

15.4 

15.4 

15.4 

15.4 

30 

33.3 

50.0 

50.0 

50.0 

50.0 

50.0 

31 

50.0 

62.5 

62.5 

62.5 

62.5 

62.5 

32 

10.5 

10.5 

10.5 

10.5 

10.5 

10.5 

33 

25.0 

25.0 

25.0 

25.0 

25.0 

25.0 

35 

25.0 

25.0 

25.0 

25.0 

25.0 

25.0 

39 

28.6 

28.6 

28.6 

28.6 

28.6 

28.6 

46 

22.9 

54.2 

58.3 

58.3 

58.3 

58.3 

47 

8.3 

8.3 

8.3 

8.3 

8.3 

8.3 

48 

30.8 

38.5 

38.5 

38.5 

38.5 

38.5 

51 

7.4 

11.1 

11.1 

18.5 

18.5 

18.5 

54 

16.7 

16.7 

16.7 

16.7 

16.7 

16.7 

57 

12.5 

12.5 

12.5 

12.5 

12.5 

12.5 

59 

21.4 

21.4 

21.4 

21.4 

21.4 

21.4 

62 

14.3 

28.6 

28.6 

28.6 

28.6 

28.6 

69 

25.0 

25.0 

25.0 

25.0 

25.0 

25.0 

72 

12.5 

12.5 

12.5 

12.5 

12.5 

12.5 

87 

3.7 

22.2 

22.2 

22.2 

22.2 

22.2 

110 

77.8 

85.2 

88.9 

88.9 

85.2 

88.9 

Acc . 

28.1 

39.7 

42.6 

44.7 

43.9 

42.3 
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els generated by training DSVM 0AA using the 
datasets composed by CS, CSH, and CSHP. The 
predictions obtained on the test sample by each 
of these three direct classifiers were combined, 
deriving the predictions for a new classifier based 
on a majority-voting scheme. If two or more folds 


classes receive the same number of votes for a 
protein, the tie is resolved by assigning the class 
fold corresponding to the largest score among 
the alternative folds classes predicted for that 
protein. The same procedure was repeated using 
the six combinations of direct classifiers {C}, 


Table 5. Prediction accuracy (%) for each fold of the test sample using DSVM RR in the voting frame- 
work 


Voting -V-DSVM sb 

Fold 

index 

{C} 

{C+CS} 

{C+CS+CSH} 

{C+CS+CSH+ 

CSHP} 

{C+CS+CSH+ 

CSHP+CSHPV} 

{ALL} 

1 

83.3 

83.3 

83.3 

83.3 

83.3 

83.3 

3 

77.8 

88.9 

88.9 

88.9 

88.9 

88.9 

4 

35.0 

35.0 

35.0 

35.0 

40.0 

40.0 

7 

25.0 

50.0 

50.0 

50.0 

50.0 

37.5 

9 

44.4 

66.7 

66.7 

66.7 

66.7 

55.6 

11 

22.2 

22.2 

22.2 

22.2 

22.2 

11.1 

20 

54.5 

68.2 

86.4 

90.9 

88.6 

88.6 

23 

16.7 

16.7 

16.7 

16.7 

16.7 

8.3 

26 

30.8 

53.8 

53.8 

53.8 

53.8 

46.2 

30 

33.3 

50.0 

50.0 

50.0 

50.0 

33.3 

31 

50.0 

50.0 

50.0 

50.0 

37.5 

37.5 

32 

21.1 

36.8 

36.8 

36.8 

36.8 

36.8 

33 

25.0 

50.0 

50.0 

50.0 

25.0 

25.0 

35 

25.0 

25.0 

25.0 

25.0 

25.0 

25.0 

39 

39 

28.6 

42.9 

42.9 

42.9 

42.9 

46 

46 

64.6 

72.9 

89.6 

91.7 

91.7 

47 

47 

66.7 

50.0 

58.3 

58.3 

58.3 

48 

48 

23.1 

46.2 

46.2 

46.2 

46.2 

51 

51 

18.5 

40.7 

40.7 

40.7 

40.7 

54 

54 

33.3 

33.3 

33.3 

33.3 

33.3 

57 

57 

0.0 

25.0 

25.0 

25.0 

12.5 

59 

59 

21.4 

42.9 

42.9 

42.9 

35.7 

62 

62 

14.3 

42.9 

42.9 

42.9 

42.9 

69 

69 

0.0 

25.0 

25.0 

25.0 

25.0 

72 

72 

12.5 

12.5 

12.5 

12.5 

12.5 

87 

87 

18.5 

33.3 

48.1 

48.1 

48.1 

110 

110 

48.1 

70.4 

92.6 

92.6 

92.6 

Acc . 

Acc . 

37.7 

50.4 

57.4 

58.2 

57.1 
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{C+CS}, {C+CS+CSH}, {C+CS+CSH+CSHP}, 
{C+CS+CSH+CSHP+CSHPV}, and {C+CS+CS 
H+CSHP+CSHPV+CSHPVZ} (this latter denoted 
by {ALL}) for each of the methods DSVM OA4 and 
DSVM RR . The ensemble classifiers, based on the 
voting scheme, are denoted as V-DSVM 0AA and 
V-DSVM RR . 

In order to choose the best set of parameters 
for each of the resulting 24 classifiers, 10-fold 
cross-validation was applied on the training 
dataset before evaluating the accuracy on the 
test sample. 

Tables 2 and 3 show the overall accuracy on 
the test sample for the 24 classifiers, as well as for 
other competing methods considered in Ding and 
Dubchak (2001). In particular, Table 2 indicates, 
in rows 3-4, the performance exhibited by the 
direct classifiers DSV1VL and DSVM od by us- 
ing different sets of attributes during the training 
process. Rows 1-2, in Table 2, report the accuracy 
values obtained in Ding and Dubchak (2001) for 
SVM, based on the one-against-all (SVM QAA ) and 
the unique-one-against-all (SVM U0AA ) schemes 
trained with the same attributes. Table 3 contains 
the results provided according to distinct combina- 
tions ofpredictions generated by alternative direct 
classifiers embodied into the voting mechanism. 
The first four rows in Table 3 report the accuracy 
values obtained in Ding and Dubchak (2001) for 
neural networks (V-NNO QAA ) and SVM (V-SV- 
M oaa ) based on the one-against-all scheme, for 
SVM framed within the round-robin (V-SVM RR ) 
framework, and for SVM based on a unique-one- 
against-all procedure (V-SVM U0AA ). These four 
methods were derived from a majority-voting 
scheme that used a combination of the predic- 
tions, as indicated in the columns. The last two 
rows refer, instead, to the ensemble classifiers 
V-DSVM qaa and V-DSVM RR . 

The results presented in Tables 2 and 3 sug- 
gest some empirical findings. First, notice that 
discrete SVM achieve the highest accuracy of 
51.4%, marked in bold, among the direct meth- 


ods considered in Table 2. When discrete SVM 
are adopted as the base binary classifier, the 
round-robin scheme constantly dominates the 
one-against-all framework. This remark can be 
intuitively explained by the fact that, due to the 
large number of folds, the one-against-all scheme 
can hardly distinguish between a single fold, often 
represented by a few proteins, and a relatively 
large set of proteins, including heterogeneous 
folds. Moreover, the voting mechanism notably 
improves the accuracy for all the classifiers, 
playing an effective regularization role. Finally, 
the best overall accuracy of 58.2% on the test 
sample is achieved by discrete SVM, embodied 
into the round-robin scheme and subject to the 
voting mechanism applied to the combination of 
predictions {C+CS+CSH+CSHP}. Notice that 
by using a larger number of predictions, derived 
from the direct classifiers, the overall accuracy 
decreases, since the two sets of attributes “V” and 
“Z” introduce some disturbing noise. 

Tables 4 and 5 provide detailed accuracy re- 
sults, at the level of singular folds, for the voting 
methods V-DSVM QAA and V-DSVM RR . As one 
might expect, the accuracy is higher for the most 
populated folds. 

Although the prediction accuracy achieved 
by the different methods might appear low with 
respect to other classification tasks, one should 
bear in mind that there are 27 class folds, so that 
a random classifier would obtain an average ac- 
curacy of 3.7%. Furthermore, the use of machine 
learning methods leads to great benefits, since 
the recognition in vitro of the folding of a new 
protein is a costly and complex activity. 
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ABSTRACT 

Recently, clustering and classification methods have seen many applications in bioinformatics. Some 
are simply straightforward applications of existing techniques, but most have been adapted to cope with 
peculiar features of the biological data. Many biological data take a form of vectors, whose components 
correspond to attributes characterizing the biological entities being studied. Comparing these vectors, 
aka profiles, are a crucial step for most clustering and classification methods. We review the recent 
developments related to hierarchical profiling where the attributes are not independent, but rather are 
correlated in a hierarchy. Hierarchical profiling arises in a wide range of bioinformatics problems, 
including protein homology detection, protein family classification, and metabolic pathway clustering. 
We discuss in detail several clustering and classification methods where hierarchical correlations are 
tackled in effective and efficient ways, by incorporation of domain-specific knowledge. Relations to 
other statistical learning methods and more potential applications are also discussed. 


INTRODUCTION 

Profiling entities based on a set of attributes and 
then comparing these entities by their profiles is a 
common, and often effective, paradigm in machine 
learning. Given profiles, frequently represented as 


vectors of binary or real numbers, the comparison 
amounts to measuring “distance” between a pair 
of profiles. Effective learning hinges on proper 
and accurate measure of distances. 

In general, given a set A of N attributes, A = 
(a. |i = 1, . . ., N}, profiling an entity x on A gives 
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a mapping p(x) — > 9? N , namely, p(x) is an N vec- 
tor of real values. Conveniently, we also use x to 
denote its profile p(x), and x. the i -th component 
of p(x). If all attributes in A can only have two 
discrete values 0 and 1, then p(x) — » (0,1} N yields 
a binary profile. The distance between a pair of 
profiles x and y is a function: D(x, y) — > tR. Ham- 
ming distance is a straightforward, and also one 
of the most commonly used, distance measures 
for binary profiles; it is a simple summation of 
difference at each individual component: 

D (x, y) = I," d(i) (1) 

where d(i) = | x. - y. |. For example, given x = (0, 
1, 1, 1, 1) and y = (1, 1, 1, 1, 1), then D(x, y) = 
X. =1 5 d(i) = 1+0+0+0+0 = 1. A variant definition 
of d(i), which is also very commonly used, is 
that d(i) = 1 if x = y and d(i) = -1 if otherwise. 
In this variant definition, D(x, y) = E. =1 5 d(i) = 
-l+l+l+l+l = 3. 

The Euclidean distance, defined as 
D = VZi" (x; - yd 2 , has a geometric representa- 
tion: a profile is mapped to a point in a vector 
space where each coordinate corresponds to an 
attribute. Besides using Euclidean metric, in vector 
space the distance between two profiles is also 
often measured as dot product of the two corre- 
sponding vectors: x • y = E.” x. y.. Dot product is 
a key quantity used in Support Vector Machines 
(Vapnik, 1997, Cristianini & Shawe -Taylor, 2000, 
Scholkopf & Smola, 2002). Many clustering 
methods applicable to vectors in Euclidean space 
can be applied here, such as K-means. 

While Hamming distance and Euclidean 
distance are the commonly adopted measures of 
profile similarity, both of them imply an underly- 
ing assumption that the attributes are independent 
and contribute equally in describing the profile. 
Therefore, the distance between two profiles is 
simply a sum of distance (i.e., difference) between 
them at each attribute. These measures become 
inappropriate when the attributes are not equally 
contributing, or not independent, but rather cor- 


related to one another. As we will see, this is often 
the case in the real-world biological problems. 

Intuitively, nontrivial relations among attri- 
butes complicate the comparisons of profiles. An 
easy and pragmatic remedy is to introduce scores 
or weighting factors for individual attributes to 
adjust their apparently different contribution to 
the Hamming or Euclidean “distance” between 
profiles. That is, the value of d(i) in equation (1) 
now depends not only on the values of x.and y., 
but also on the index i. Often, scoring schemes 
of this type are also used for situations where 
attributes are correlated, sometimes in a highly 
nonlinear way. Different scoring schemes thereby 
are invented in order to capture the relationships 
among attributes. Weighting factors in these scor- 
ing schemes are either preset a priori based on 
domain knowledge about the attributes, or fixed 
from the training examples, or determined by a 
combination of both. To put into a mathematical 
framework, those scoring based approaches can be 
viewed as approximating the correlations among 
attributes, which, without loss of generality, can be 
represented as a polynomial function. In general, 
a formula that can capture correlations among 
attributes as pairs, triples, quadruples, and so 
forth, may look like the following: 

D’ = 2 , n d(i) + Z ]#j " d(i)c(i,j)d(j) + 
d(i)d(j)d(k)c(i,j,k) + . . . (2) 

where the coefficients c(i,j), c(i,j,k), . . ., are used 
to represent the correlations. This is much like 
introducing more neurons and more hidden 
layers in an artificial neural network approach, 
or introducing a nonlinear kernel functions in 
kernel-based methods. Because the exact rela- 
tions among attributes are not known a priori, 
an open formula like equation (2) is practically 
useless: as the number of these coefficients grows 
exponentially with the profile size, solving it 
would be computationally intractable, and there 
would not be enough training examples to lit 
these coefficients. 
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However, it was found that the situation would 
become tractable when these correlations could 
be structured as a hierarchy — a quite loose re- 
quirement and readily met in many cases as we 
shall see later. In general, a hierarchical profile of 
entity x can be defined as p(x) — » (0,1} L , where L 
stands for the set of leaves in a rooted tree T. A 
hierarchical profile is no longer a plain string of 
zeros and ones. Instead, it may bebest represented 
as a tree with the leaves labeled by zeros and ones 
for binary profiles, or by real value numbers for 
real value profiles. 

As the main part of this chapter, we will dis- 
cuss in detail several clustering and classification 
methods where hierarchical profiles are coped 
with effectively and efficiently, noticeably due 
to incorporation of domain specific knowledge. 
Relations to other statistical learning methods, 
e.g., as kernel engineering, and more possible 
applications to other bioinformatics problems are 
also discussed, towards the end of this chapter. 

HIERARCHICAL PROFILINGS IN 
BIOINFORMATICS 

Functional Annotations 

Hierarchical profiling arises naturally in many 
bioinformatics problems. The first example is 
probably from the phylogenetic profiles of proteins 
and using them to assign functions to proteins 
(Pellegrini et al. , 1999). To help understand the key 
concepts and motivations, and also to introduce 
some terminologies for later discussions, we first 
briefly review the bioinformatics methodologies 
for functional annotation. 

Determining protein functions, also called 
as functional annotation, has been and remains 
a central task in bioinformatics. Over the past 25 
years many computational methodologies have 
been developed toward solving this task. The 
development of these computational approaches 
can be generally broken into four stages by both 


the chronological order and algorithmic sophis- 
tication (Liao & Noble, 2002). For our purpose 
here, we can categorize these methods into three 
levels according to the amount and type of infor- 
mation used. 

The methods in level one compare a pair of 
proteins for sequence similarity. Among them are 
the landmark dynamic programming algorithm 
by Smith and Waterman (1980) and its heuristic 
variations BLAST (Altschul et al., 1990) and 
FASTA (Pearson, 1990). The biological reason 
behind these methods is protein homology; two 
proteins are homologous if they share a common 
ancestor. Therefore, homologous proteins are 
similar to each other in the primary sequence 
and keep the same function as their ancestor’s, 
until the evolutionary divergence — mutations, 
deletions, or insertions from the ancestral protein 
(or gene, to be more precise) is significant enough 
to cause any change. A typical way to annotate a 
gene with unknown function is to search against 
a database of genes whose functions are already 
known, such as GenBank (http://www.ncbi.nlm. 
nih.gov/Genbank), and assign to the query gene 
the function of a homologous gene found in the 
database. 

The next level’s methods use multiple se- 
quences from a family of proteins with same or 
similar functions, in order to collect aggregate 
statistic for more accurate/reliable annotation. 
Profiles (Gribskov et al., 1987) and hidden Mar- 
kov models (Krogh et al., 1994; Durbin et al., 
1998) are two popular methods to capture and 
represent these aggregate statistics from whole 
sequences for protein families. More refined and 
sophisticated methods using aggregate statistics 
are developed, such as PSI-BLAST (Altschul et 
al., 1997), SVM-Fisher (Jaakkola et al., 1999, 
2000), Profile-Profile (Sadreyev & Grishi, 2003; 
Mittelman et al., 2003), SVM-Pairwise (Liao & 
Noble, 2002, 2003). The aggregate statistic may 
also be represented as patterns or motifs. Methods 
based on patterns and motifs include BLOCKs 
(Henikoff & Henikoff, 1994), MEME (Bailey & 
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Elkan, 1995), Meta-MEME (Grundy etal., 1997), 
and eMotif (Nevill-Manning et al., 1998). 

The third level’s methods go beyond sequence 
similarity to utilize information such as DNA 
Microarray gene expression data, phylogenetic 
profiles, and genetic networks. Not only can these 
methods detect distant homologues — homologous 
proteins with sequence identity below 30%, but 
they also can identify proteins with related func- 
tions, such as those found in a metabolic pathway 
or a structural complex. 

Given a protein, its phylogenetic profile is 
represented as a vector, where each component 
corresponds to a genome and takes a value of either 
one or zero, indicating respectively the presence 
or absence of a homologous protein in the cor- 
responding genome. Protein phylogenetic profiles 
were used in Pellegrini, Marcotte, Thompson, 
Eisenberg, and Yeates (1999) to assign protein 
functions based on the hypothesis that function- 
ally linked proteins, such as those participating 
in a metabolic pathway or a structural complex, 
tend to be preserved or eliminated altogether in 
a new species. In other words, these functionally 
linked proteins tend to co-evolve during evolution. 
In Pellegrini etal. (1999), 16 then-fully-sequenced 
genomes were used in building the phylogenetic 
profiles for 4290 proteins in E. coli genome. The 
phylogenetic profiles, expressed as 16-vector, were 
clustered as following; proteins with identical 
profiles are grouped and considered to be function- 
ally linked, and two groups are called neighbors 
when their phylogenetic profiles differ by one bit. 
The results based on these simple clustering rules 
supported the functional linkage hypothesis. For 
instance, homologues of ribosome protein RL7 
were found in 10 out of 11 eubacterial genomes 
and in yeast but not in archaeal genomes. They 
found that more than half of the E. coli proteins 
with the RL7 profile or profiles different from 
RL7 by one bit have functions associated with 
the ribosome, although none of these proteins 
share significant sequence similarity with the RL7 
protein. That is, these proteins are unlikely to be 


annotated as RL7 homologues by using sequence 
similarity based methods. 

There are some fundamental questions regard- 
ing the measure of profile similarity that can affect 
the results from analysis of phylogenetic profiles. 
Can we generalize the definition of similar pro- 
files? In other words, can we devise a measure so 
we can calculate similarity for any pair of profiles? 
What threshold should be adopted when profile 
similarity is used to infer functional linkage? 

A simple measure for profile similarity first 
brought up was Hamming distance. In Marcotte, 
Xenarios, van Der Bliek, and Eisenberg (2000), 
phylogenetic profiles are used to identify and pre- 
dict subcellular locations for proteins; they found 
mitochondrial and non-mitochondrial proteins. 

The first work that recognizes phylogenetic 
profiles as a kind of hierarchical profiling is 
Liberies, Thoren, vonHeijne, andElofsson (2002), 
a method that utilizes the historical evolution of 
two proteins to account for their similarity (or 
dissimilarity). Evolutionary relationships among 
organisms can be represented as a phylogenetic 
tree where leaves correspond to the current organ- 
isms and internal nodes correspond to hypothetical 
ancient organisms. So, rather than simply counting 
the presence and absence of the proteins in the 
current genomes, a quantity called differential 
parsimony is calculated that minimize the number 
of times when changes have to be made at tree 
branches to reconcile the two profiles. 

Comparative Genomics 

Another example of hierarchical profiling arises 
from comparing genomes based on their metabolic 
pathway profiles. A main goal of comparing ge- 
nomes is to reveal the evolutionary relationships 
among organisms, which can be represented as 
phylogenetic trees. 

Originally phylogenetic trees were constructed 
based on phenotypic — particularly morphologi- 
cal — features of organisms. Nowadays, molecular 
reconstruction of phylogenetic trees is most corn- 
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monly based on comparisons of small sub-unit 
ribosomal RNA (16S rRNA) sequences (Woese, 
1987). The small sub-unit rRNAs is orthodoxly 
used as gold standard for phylogeny study, mainly 
due to two factors: their ubiquitous presence 
and relative stability during evolution. However, 
the significance of phylogenies based on these 
sequences have been recently questioned with 
growing evidence for extensive lateral transfer 
of genetic material, a process which results in 
blurring the boundaries between species. Phy- 
logenetic trees based on individual protein/gene 
sequence analysis — thus called gene tree — are 
not all congruent with species trees based on 
16S rRNA (Eisen, 2000). Attempts of building 
phylogenetic trees based on different informa- 
tion levels, such as pathways (Dandekar et al., 
1999; Forst & Schulten, 2001) or some particular 
molecular features such as folds (Lin & Gerstein 
2000), have also led to mixed results of congru- 
ence with the 16S rRNA-based trees. 

From a comparative genomics perspective, 
it makes more sense to study evolutionary rela- 
tionships based on genome-wide information, 
instead of a piece of the genome, be it an rRNA 
or a gene. As more and more genomes are fully 
sequenced, such genome -wide information be- 
comes available. One particularly interesting 
type of information is the entire repertoire of 
metabolic pathways in an organism, as the cellular 
functions of an organism are carried out via these 
metabolic pathways. A metabolic pathway is a 
chain ofbiochemical reactions, together fulfilling 
certain cellular functions. For example, Glycolysis 
is a pathway existed in most cells, which consists 
of 10 sequential reactions converting glucose to 
pyruvate while generating the energy that the cell 
needs. Because most of these reactions require 
enzymes as catalyst, therefore in an enzyme 
centric scheme, pathways are represented as se- 
quences of component enzymes. It is reasonable 
to set the necessary condition for a pathway to 
exist in an organism as that all the component 
enzymes of that pathway are available. Enzymes 


are denoted by enzyme commission (EC) numbers 
which specifies the substrate specificity. Most 
enzymes are proteins. Metabolic pathways in a 
completely sequenced genome are reconstructed 
by identifying enzyme proteins that are required 
for a pathway (Gaasterland & Selkov, 1995; Karp 
et al., 2002). 

The information aboutpresence and absence of 
metabolic pathways in genomes can be represented 
as a binary matrix, as shown in Figure 1, where 
an entry (/,/) = 1/0 represents whether pathway j is 
present/absent in genome i. Therefore, each row 
serves as a profile of the corresponding genome 
(termed metabolic pathway profiles), and each 
column serves as a profile of the corresponding 
pathway (termed phyletic profiles). It is reason- 
able to believe that valuable information about 
evolution and co-evolution is embedded in these 
profiles, and comparison of profiles would reveal, 
to some degree, the evolutionary relations among 
entities (either genomes or pathways) represented 
by these profiles. 

Once again, the attributes used for building 
these profiles are not independent but correlated 
to each other. Because different metabolic path- 
ways may be related to one another in terms 
of physiological functions, e.g., one pathway’s 
absence/presence may be correlated with another 
pathway’s absence/presence in a genome, these 
relationships among pathways, as attributes of 
metabolic pathway profiles, should be taken into 
account when comparing genomes based on their 
MPPs. The relationships among various pathways 
are categorized as a hierarchy in the WIT data- 
base (Overbeek et al., 2000), which can be found 
at the following URL (http://compbio.mcs.anl. 
gov/puma2/cgi-bin/functional_overview.cgi). 

COMPARING HIERACRCHICAL 
PROFILES 

In the last section, we showed that the data and in- 
formation in many bioinformatics problems can be 
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Figure 1. Binary matrix encoding the presence/ab- 
sence of pathways in genomes; O j to O m represent 
m genomes, and P j to P n represent n pathways 
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represented as hierarchical profiles. Consequently, 
the clustering and classification of such data and 
information need to account for the hierarchical 
correlations among attributes when measuring 
the profile similarity. While a generic formula 
like equation (2) posits to account for arbitrary 
correlations theoretically, its demand of expo- 
nentially growing amount of training examples 
and lacking of an effective learning mechanism 
render the formula practically useless. In hierar- 
chical profiles, however, the structure of relations 
among attributes is known, and sometimes the 
biological interpretation of these relations is also 
known. As a result, the learning problems will 
become rather tractable. 

P-Tree Approach 

Realizing that the hierarchical profiles contain 
information not only in the vector but also in 
the hierarchical correlations, it is natural to first 
attempt at treating them as trees. How to com- 
pare trees is itself an interesting topic with wide 
applications, and has been the subject of numer- 
ous studies. In Liao, Kim, and Tomb, (2002), a 
p-tree based approach was proposed to measure 
the similarity of metabolic pathway profiles. The 
relationships among pathways are adopted from 


the WIT database and represented as a Master 
tree. About 3300 known metabolic pathways are 
collected in the WIT database and these pathways 
are represented as leaves in the Master tree. Then, 
for each genome, a p-Tree is derived from the 
Master tree by marking off leaves whose cor- 
responding pathways are absent from the organ- 
ism. In this representation, a profile is no longer 
a simple string of zeros and ones, where each bit 
is treated equally and independently. Instead, it 
is mapped into a p-Tree so that the hierarchical 
relationship among bits is restored and encoded 
in the tree. 

The comparison of two p-Trees thus evaluates 
the difference between the two corresponding 
profiles. To take into account the hierarchy, a scor- 
ing scheme ought to weight (mis)matches at bits i 
and j according to their positions in the tree, such 
as z and) are sibling, versus i and j are located 
distantly in the tree. For that, the (mis)matches 
scores are transmitted bottom-up to the root of the 
Master tree in four steps: (1) overlay the two p- 
Trees; (2) score mismatches and matches between 
two p-Trees and label scores at the corresponding 
leaves on the master tree; (3) average scores from 
siblings (weight breadth) and assign the score to 
the parent node; (4) iterate step 3 until the root is 
reached. An algorithm implementing these steps 
is quite straightforward and has a time complexity 
linear with the size of the Master tree. 

To demonstrate how the algorithm works, 
let us look at an example of two organisms, org. 
and org., and three pathways p |; p 2 , and p r Two 
hypothetical cases are considered and are dem- 
onstrated in Figures 2 and 3 respectively. In 
case one, org. contains pathways p 3 and p 3 , and 
org. contains p 2 and p 3 . The metabolic pathway 
profiles for org. and org. are shown in panel B 
and their corresponding p-Trees are displayed in 
panel C. In the panel D, two p-Trees are super- 
posed. Matches and mismatches are scored at the 
leaves and the scores are propagated up to the 
root. In case two, org. contains pathways p 3 and 
p 2 , whereas org. contains all three pathways, and 
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Figure 2. A comparison of two organisms orgi and orgj, with respect to the three pathways pi, p2, and 
p3 (weighted scoring scheme, Case 1); Panel A featuring the master tree, where each pathway is rep- 
resented by a circle and each square represents an internal node; in Panel B, the presence(absence) of 
a pathway is indicated by 1(0); Panel C features the p-trees, where a filled (empty) circle indicates the 
presence (absence) of the pathway; Panel D, the bottom-up propagation of the scores is illustrated; a 
cross (triangle) indicates a mismatch (match) 



Figure 3. Case 2: A comparison of two organisms org. and org., with respect to the three pathways pi, 
p2, and p3 (weighted scoring scheme. Case 2); the same legends as in Figure 2 
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similar calculation is shown in Figure 3. In this 
example, given the topology of the master tree, 
the two cases have the same final score 0 in the 
p-Tree scoring scheme. This is in contrast to the 
variant Hamming scoring scheme, where the score 
for case 1 equals to (-1-1+1) = -1 and the score for 
case 2 equals to (1+1-1) = 1. Evidently, the p-Tree 
scoring scheme has taken into account the pathway 
relationships present in the Master tree: p and 
p 2 , being in the same branch, are likely to have a 
similar biochemical or physiological role, which 
is distinct from the role of the p 3 pathway. 

Using pathway hierarchical categories in the 
WIT database and the scoring scheme described 
above, we compared 31 genomes based on their 
metabolic pathway profiles (MPP) (see Figure 
4). Relations among 31 genomes are represented 
as a MPP-based tree and are compared with the 
phylogeny based on 16s rRNA. While the MPP- 
based tree is congruent with the 16S rRNA-based 


tree at several levels, some interesting discrep- 
ancies were found. For example, the extremely 
radiation resistant organism, D. radiodurans is 
positioned in E. coli metabolic lineage. Thenoted 
deviation from the classical 16s rRNA phylogeny 
suggests that pathways have undergone evolution 
transcending the boundaries of species and gen- 
era. The MPP-based trees can be used to suggest 
alternative production platform for metabolic en- 
gineering. A different approach to this problem on 
the same dataset was later proposed in Heymans 
and Singh (2003). 

Parameterized Tree Distance 

Evolution of pathways and the interaction with 
evolution of the host genomes can be further 
investigated by comparing the columns (phyletic 
profiles) in the binary matrix in Figure 1. The 
intuition is that the co-occurrence pattern of 


Figure 4. MPP-based tree and 16S rRNA-based tree for 31 genomes; neighbor-joining program from 
Phylip package is used to generate these trees 
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pathways in a group of organisms would coincide 
with the evolutionary path of these host organ- 
isms, for instance, as a result of the emergence of 
enzymes common to those pathways. Therefore, 
such patterns would give useful information, e.g., 
organism-specific adaptations. These co-occur- 
rences can be detected by clustering pathways 
based on their phyletic profiles. Again, the phy- 
letic profiles here are hierarchical profiles, and 
the distance measure between them should be 
calculated using the scheme that we developed 
above. In Zhang, Liao, Tomb, and Wang (2002), 
2719 pathways selected from 31 genomes in the 
WIT database were clustered into 69 groups of 
pathways, and pathways in each group co-occur 
in the organisms. 

Further insights were achieved by studying 
the evolution of enzymes that are components in 
co-occurredpathways. For completely sequenced 
genomes, sequences of these enzymes are avail- 
able and can be used to build individual gene trees 
(in contrast to species trees). Comparisons of the 
gene trees of component enzymes in co-occurred 
pathways would serve as a basis for investigating 
how component enzymes evolve and whether 
they evolve in accordance with the pathways. For 
example, two pathways pi = (el, e2, e3) and p2 
= (e4, e5) co-occur in organisms ol, o2, and o3. 
For component enzyme el of pathway pi, a gene 
tree T is built with el’s homologues in organ- 
isms ol, o2 and o3, by using some standard tree 
reconstruction method such as neighbor joining 
algorithm (Felsenstein, 1989). This can be done 
for other component enzymes e2 to e5 as well. 
If, as an ideal case, gene trees T ei for i = 1 to 5 
are identical, we would then say that the pathway 
co-occurrence is congruent to speciation. When 
gene trees for component enzymes in co-occurred 
pathways are not identical, comparisons of gene 
trees may reveal how pathways evolve differently, 
possibly due to gene lateral transfer, duplication 
or loss. Recent work also shows that analysis 
of metabolic pathways may explain causes and 
evolution of enzymes dispensability. 


To properly address tree comparisons, several 
algorithms were developed and tested (Zhang et 
al., 2002). One similarity measure is based on 
leaf overlap. Let T1 and T2 be two trees. Let SI 
be the set of leaves of T1 and S2 be the set of 
leaves of T2. The leaf-overlap-based distance is 
defined as 

D x (Tl, T2) = |S1 n S2 | / |S1 u S2 |, (3) 

where |.| denotes the set cardinality. A more 
elaborated metric, called parameterized distance 
as an extension from the editing distance between 
two unordered trees, was proposed to account 
for the structural difference between ordered 
and rooted trees. A parameter c is introduced to 
balance the cost incurred at different cases such 
as deleting, inserting, and matching subtrees. A 
dynamic programming algorithm is invoked to 
calculate the optimal distance. For example, in 
Figure 5, the distance between D1 and p is 1.875 
and the distance between D2 and p is 1.813 when 
the parameter c is set at value of 0.5. On the other 
hand, the editing distance between D1 and p and 
between D2 and p are both 6, representing the 
cost of deleting the six nodes not touched by the 
dotted mapping lines in Figure 5. Noting that D1 
differs from D2 topologically, this example shows 
that the parameterized distance better reflects the 
structural difference between trees than the edit- 
ing distance. In Zhang (2002), the 523 component 
enzymes of these 2719 pathways were clustered, 
based on parameterized tree distance with c being 
set at 0.6, into exactly the same 69 clusters based 
onpathway co-occurrence. This suggests that our 
hypothesis about using co-occurrence to infer 
evolution is valid, at least approximately. 

Tree Kernel Approach 

As shown before, use of phylogenetic profiles 
for proteins has led to methods more sensitive to 
detect remote homologues. In the following we 
discuss classifiers using support vector machines 
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Figure 5. Illustration of parameterized distances 
between trees. Red dotted lines map tree P to trees 
D1 and D2, the parameter C = 1 gives the editing 
distances; the editing distance between D1 and P 
is 6, because six deletions of nodes other than a, 
b, and c in D1 will make D1 identical to tree P, 
although D1 and D2 are topologically different, 
their editing distances to P are both equal to 6 



to explore similarity among phylogenetic profiles 
as hierarchical profiles. 

As a powerful statistical learning method, 
support vector machines (SVMs) have recently 
been applied with remarkable success in bioin- 
formatics problems, including remote protein 
homology detection, microarrray gene expres- 
sion analysis, and protein secondary structure 
prediction. SVMs have been applied to problems 
in other domains, such as face detection and text 
categorization. SVMs possess many nice charac- 
teristics a good learning method shall have: it is 
expressive; it requires fewer training examples; 
it has an efficient learning algorithm; it has an 
elegant geometric interpretation; and above all, 
it generalizes well. The power of SVMs comes 
partly from the data representation, where an 
entity, e.g., a protein, is represented by a set of 
attributes instead of a single score. However, how 


those attributes contribute to distinguishing a 
true positive (filled dot in Figure 6) from a true 
negative (empty circle) may be quite complex. In 
other words, the boundary line between the two 
classes, if depicted in a vector space, can be highly 
nonlinear (dashed line in the left panel of Figure 
6), and nonetheless, it is the goal of a classifier to 
find this boundary line. The SVMs method will 
find a nonlinear mapping that transform the data 
from the original space, called input space, into 
a higher dimensional space, called feature space, 
where the data can be linearly separable (right 
panel of Figure 6). The learning power of SVMs 
comes mostly from the use of kernel functions, 
which define how the dot product between two 
points in the feature space can be calculated as a 
function of their corresponding vectors in the input 
space. As the dot product between vectors is the 
only quantity needed to find a class boundary in 
the feature space, kernel functions therefore con- 
tain sufficient information, and more importantly 
they avoid explicit mapping to high dimensional 
feature space; high dimensionality often poses 
difficult problems for learning such as overfitting, 
thus termed the curse of dimensionality. The 
other mechanism adopted by SVMs is to pick 
the boundary line that has the maximum margin 
to both classes. A maximum margin boundary 
line has low Vapnik-Chervonenkis dimension, 
which ensures good generalization. Because of 
the central role played by kernel functions, how to 
engineer kernel functions to incorporate domain 
specific information for better performance has 
been an active research activity. It is worth not- 
ing that, as compared to other similar learning 
methods such as artificial neural networks, the 
SVMs require fewer training examples, which 
is a great advantage in many bioinformatics ap- 
plications. 

Vert (2002) proposed a tree kernel to compare 
not just the profiles themselves but also their 
global patterns of inheritance reflected in the 
phylogenetic tree. In other words, the tree ker- 
nel takes into account phylogenetic histories for 
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two genes — when in evolution they transmitted 
together or not - rather than by just comparing 
phylogenetic profiles organism per organism, 
at the leaves of the phylogenetic tree. A kernel 
function is thus defined as: 

K(x,y) = I HioD ®(x)cl>(y) (4) 

where ®.(x) is an inheritance pattern z for profile 
x. An inheritance pattern of x gives an explana- 
tion of x, that is, the presence (1) or absence (0) 
of gene x at each current genome is the result of 
a series of evolutionary events happened at the 
ancient genomes. If we assume a gene is pos- 
sessed by the ancestral genome, at each branch 
of the phylogenetic tree which corresponds to a 
speciation, the gene may either be retained or 
get lost. An inheritance pattern corresponds to 
a series of assignments of retain or loss at all 
branches such that the results at leaves match the 
profile of x. Because of the stochastic property of 
these evolutionary events, we cannot be certain 
whether a gene is retained or lost. Rather, the best 
we know may be the probability of either case. 
Let F.(x) = ®(x | i), which is the probability that 
profile x can be interpreted by the inheritance 
pattern i, then K (x, y) = S. =1 toD P(x|i) P(y|i) is 
the joint probability that both profiles x and y are 


resulted from all possible pattern i. Intuitively, not 
all possible inheritance patterns occur at the same 
frequency. Let P(i) be the probability that pattern 
z actually occurs during the evolution, then the 
so called tree kernel is refined as: 

K(x,y) = I WtoD P(i) P(x|i) P(y|i) (5) 

The formalism of using joint probability as 
kernels first appeared in other applications, such 
as convolution kernels (Watkins, 1999; Haussler, 
1999). Because the number of patterns D grows 
exponentially with the size of the phylogenetic 
tree, an efficient algorithm is needed to compute 
the kernel. Such an algorithm was developed in 
Vert (2002), which uses post-order traversals of 
the tree and has a time complexity linear with 
respect to the tree size. 

To test the validity of the tree kernel method, 
phylogenetic profiles were generated for 2465 
yeast genes, whose accurate functional classifica- 
tion are already known, by BLAST search against 
24 fully-sequenced genomes. For each gene, if a 
BLAST hit with E-value less than 1.0 is found in 
a genome, the corresponding bit for that genome 
is then assigned as 1, otherwise is assigned as 0. 
The resulting profile is a 24-bit string of zeros 
and ones. Two assumptions were made in calcu- 
lating the tree kernel for a pair of genes x and y. 


Figure 6. Schematic illustration of nonlinear mapping of data from input space to feature space for a 
SVM 
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Although their exact probabilities may never be 
known, it is reasonable to assume that losing a 
gene or obtaining a new gene is relatively rare as 
compared to keeping the status quo. In Vert, the 
probability that an existing gene is retained at a 
tree branch (i.e., speciation) is set at 0.9, and the 
probability that a new gene is created at a branch 
is set at 0.1. It was further assumed that such a 
distribution remains the same at all branches for 
all genes. Even with these crude assumptions, in 
the cross validation experiments on those 2465 
yeast genes, the tree kernel’s classification accu- 
racy already significantly exceed that of a kernel 
using just the dot product x • y = I M to 24 x. y.. 

Extended Phylogenetic Profiles 

The tree-kernel approach’s improvement at clas- 
sification accuracy is mainly due to engineering 
the kernel functions. In Narra and Liao (2004, 
2005) further improvement is attained by both 
data representations and kernel engineering. As 
the reader should have been convinced by now, 
these phylogenetic profiles contain more infor- 
mation than just the string of zeros and ones; the 
phylogenetic tree provides relationships among 
the bits in these profiles. InNarra and Liao, a two- 
step procedure is adopted to extend phylogenetic 
profiles with extra bits encoding the tree structure: 

(1) a score is assigned at each internal tree node; 

(2) the score labeled tree is then flatten into an 
extended vector. Lor an internal tree node in a 
phylogenetic tree, as it is interpreted as ancestor 
of the nodes underneath it, one way to assign a 
score for it is to take the average of the scores 
from its children nodes. This scoring scheme 
works top-down recursively until the leaves are 
reached: the score at a leaf is just the value of 
the corresponding component in the hierarchi- 
cal profile. The same scoring scheme was also 
used in p-tree approach. Unlike p-Tree approach 
that keeps just the score the root node and thus 
inevitably causes information loss, the scores at 
all internal nodes are retained and then mapped 


into a vector via a post-order tree traversal. This 
vector is then concatenated to the original profile 
vector forming an extended vector, which is called 
tree-encoded profile. The scheme works for both 
binary and real-valued profiles. In order to retain 
information, real value profiles for yeast genes 
are used; the binary profiles for tree-kernel are 
derived real value profiles by imposing a cutoff 
at E-values. 

Given a pair of tree-encoded profiles x and y, 
the polynomial kernel is used classification: 

K(x, y) = [1+ s D(x, y)] d (6) 

where s and d are two adjustable parameters. 
Unlike ordinary polynomial kernels, D(x, y) is 
not the dot product of vectors x and y, but rather, 
a generalized Hamming distance for real value 
vectors: 

D(x, y) = X i=lton (S(|x.-y.|) (7) 

where the ad hoc function S has value 7 for a 
match, 5 for a mismatch by a difference less then 
0.1, 3 for a mismatch by a difference less than 0.3, 
and 1 for a mismatch by a difference less than 0.5. 
The values in the ad hoc function S are assigned 
based on the E-value distribution of the protein 
dataset. The methods are tested on the same data 
set by using the same cross-validation protocols 
as in Vert. The classification accuracy of using 
the extended phylogenetic profiles with E-values 
and polynomial kernel generally outperforms the 
tree-kernel approach at most of the 133 functional 
classes of 2465 yeast genes in Vert. 

MORE APPLICATIONS AND 
FUTURE TRENDS 

We have seen in the last two sections some 
problems in bioinformatics and computational 
biology where relationships can be categorized 
as hierarchy, and how such hierarchical structure 
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can be utilized to facilitate the learning. Because 
of the central role played by evolution theory in 
biology, and the fact that phylogeny is natively 
expressed as a hierarchy, it is no surprise that 
hierarchical profiling arises in many biological 
problems. 

In Siepel and Haussler (2004), methods are 
developed that combine phylogenetic and hidden 
Markov models for biosequence analysis. Hidden 
Markov models, first studied as a tool for speech 
recognition, were introduced to bioinformatics 
field around 1994 (Krogh et al., 1994). Since 
then, hidden Markov models have been applied 
to problems in bioinformatics and computational 
biology including gene identification, protein ho- 
mology detection, secondary structure prediction, 
and many more. 

In sequence modeling, hidden Markov models 
essentially simulate processes along the length of 
the sequence, mostly ignoring the evolutionary 
process at each position. On the other hand, in 
phylogenetic analysis, the focus is on the variation 
across sequences at each position, mostly ignoring 
correlations from position to position along their 
length. From a hierarchical profiling viewpoint, 
each column in a multiple sequence alignment 
is a profile, not binary but 20ary, and provides a 
sampling for the current sequences. It is a very 
attractive idea to combine these two apparently 
orthogonal models. In Siepel and Haussler, a 
simple and efficient method was developed to 
build higher-order states in the HMM, which al- 
lows for context-sensitive models of substitution, 
leading to significant improvements in the fit of 
a combined phylogenetic and hidden Markov 
model. Their work promises to be a very useful 
tool for some important biosequence analysis 
applications, such as gene finding and secondary 
structure prediction. 

Hierarchical relations also exist in applications 
where phylogeny is not the subject. For example, 
in Holme, Huss, and Jeong (2003), biochemical 
networks are decomposed into sub-networks. 
Because of the inherent non-local features pos- 


sessed by these networks, a hierarchical analysis 
was proposed to take into account the global 
structure while doing decomposition. In another 
work (Gagner et al., 2004), the traditional way of 
representing metabolic networks as a collection 
of connected pathways is questioned: such repre- 
sentation suffers the lack of rigorous definition, 
yielding pathways of disparate content and size. 
Instead, they proposed a hierarchical represen- 
tation that emphasizes the gross organization 
of metabolic networks in largely independent 
pathways and sub-systems at several levels of 
independence. 

While hierarchical relations are widely existed , 
there are many applications where the topology 
of the data relation can not be described as a 
hierarchy but rather as a network. Large scale 
gene expression data are often analyzed by clus- 
tering genes based on gene expression data alone, 
though a priori knowledge in the form of biological 
networks is available. In Hanish, Zien, Zimmer, 
and Lengauer (2002), a co-clustering method 
is developed that makes use of this additional 
information and has demonstrated considerably 
improvements for exploratory analysis. 

CONCLUSION 

As we have seen, hierarchical profiling can be 
applied to many problems in bioinformatics 
and computational biology, from remote protein 
homology detection, to genome comparisons, to 
metabolic pathway clustering, wherever the rela- 
tions among attribute data possess structure as 
a hierarchy. These hierarchical relations may be 
inherent or as an approximation to more complex 
relationships. To properly deal with such relation- 
ships is essential for clustering and classification 
of biological data. It is advisable to heed the rela- 
tions that bear artificially hierarchical structure; 
hierarchical profiling on these relations may yield 
misleading results. 
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In many cases, the hierarchy, and the biological 
insight wherein embodied, can be integrated into 
the framework of data mining. It consequently 
facilitates the learning and renders meaningful 
interpretation of the learning results. We have 
reviewed the recent developments in this respect. 
Some methods treat hierarchical profile scoring as 
a tree comparison problem, some as an encoding 
problem, and some as a graphical model with a 
Bayesian interpretation. The latter approach is of 
particular interest, since most biological data are 
stochastic by nature. 

A trend is seen in bioinformatics that com- 
bines different methods and models so the hybrid 
method can achieve a better performance. In the 
tree-kernel method, the hierarchical profiling and 
scoring are incorporated as kernel engineering 
task of the support vector machines. In sequence 
analysis, the phylogenetic techniques and hidden 
Markov models are combined to account for the 
relationships exhibited in sequences that either 
method alone can not handle properly. As more 
and more biological data with complex relation- 
ships being generated, it is reasonable to believe 
the hierarchical profiling will see more applica- 
tions, either serve by itself as a useful tool for 
analyzing these data, or serve as a prototype for 
developing more sophisticated and powerful data 
mining tools. 

REFERENCES 

Altschul, S. F., Gish, W., Miller, W., Myers, E., 
& Lipman, D. J. (1990). Basic local alignment 
search tool. Journal of Molecular Biology, 215, 
403-410. 

Altschul, S. F., Madden, T. L., Schaffer, A. A., 
Zhang, J., Zhang, Z., Miller, W., et al. (1997). 
Gapped BLAST and PSI-BLAST: A new genera- 
tion of protein database search programs. Nucleic 
Acids Res., 25, 3389-3402. 


Bailey, T. L., & Elkan, C. P. (1995). Unsupervised 
learning of multiple motifs in biopolymers using 
EM. Machine Learning, 21(1-2), 51-80. 

Burges, C. J. C. (1998). A tutorial on support vector 
machines for pattern recognition. Data Mining 
and Knowledge Discovery, 2, 121-167. 

Cristianini, N., & Shawe-Taylor, J. (2000). An in- 
troduction to support vector machines and other 
kernel-based learningmethods. Cambridge, UK: 
Cambridge University Press. 

Dandekar, T., Schuster, S., Snel, B., Huynen, M., 
& Bork, P. (1999). Pathway alignment: application 
to the comparative analysis of glycolytic enzymes. 
Biochemical Journal, 343, 115-124. 

Durbin, R., Eddy, S., Krogh, A., & Mitchison, 
G. (1998). Biological sequence analysis: Proba- 
bilistic models of proteins and nucleic acids. 
Cambridge, UK: Cambridge University Press. 

Eisen, J. A. (2000). Horizontal gene transfer among 
microbial genomes: New insights from complete 
genome analysis. Current Opinion in Genetics & 
Development, 10, 606-611. 

Felsenstein, J. (1989). PHYLIP — Phylogeny 
Inference Package (Version 3.2). Cladistics, 5, 
164-166. 

Forst, C. V., & Schulten, K. (2001). Phylogenetic 
analysis of metabolic pathways. Journal of Mo- 
lecular Evolution, 52, 471-489. 

Gaasterland, T., & Selkov, E. (1995). Reconstruc- 
tion of metabolic networks using incomplete 
information. In Proceedings of the Third Inter- 
national Conference on Intelligent Systems for 
Molecular Biology (pp. 127-135). Menlo Park, 
CA: AAAI Press. 

Gagneur, J., Jackson, D., & Casar, G. (2003). 
Hierarchical analysis of dependence in metabolic 
networks. Bioinformatics, 19, 1027-1034. 


143 


Hierarchical Profiling, Scoring, and Applications in Bioinformatics 


Gribskov, M., McLachlan, A. D., & Eisenberg, 
D. (1987). Profile analysis: Detection of distantly 
related proteins. Proc. Natl. Acad. Sci. USA, 84 
(pp. 4355-4358). 

Grundy, W. N., Bailey, T. L., Elkan, C. P., & 
Baker, M. E. (1997). Meta-MEME: Motif-based 
hidden Markov Models of biological sequences. 
Computer Applications in the Biosciences, 13(4), 
397-406. 

Hanisch, D., Zien, A., Zimmer, R., & Lengauer, 
T. (2002). Co-clustering of biological networks 
and gene expression data. Bioinformatics, 18, 
S145-S154. 

Haussler, D. (1999). Convolution kernels on dis- 
crete structures (Technical Report UCSC-CRL- 
99-10). Santa Cruz: University of California. 

Henikoff, S., & Henikoff, J.G. (1994). Protein 
family classification based on search a database 
of blocks. Genomics, 19, 97-107. 

Heymans, M., & Singh, A. J. (2003). Deriving 
phylogenetic trees from the similarity analysis 
of metabolic pathways. Bioinformatics, 19, il38- 
il46. 

Holme, R, Huss, M., & Jeong, H. (2003). Sub- 
network hierarchies of biochemical pathways. 
Bioinformatics, 19, 532-538. 

Jaakkola, T., Diekhans, M., & Haussler, D. (1999). 
Using the Fisher Kernel Method to detect remote 
protein homologies. In Proceedings of the Seventh 
International Conference on Intelligent Systems 
for Molecular Biology (pp. 149-158). Menlo Park, 
CA: AAAI Press. 

Jaakkola, T., Diekhans, M., & Haussler, D. (2000). 
A discriminative framework for detecting remote 
protein homologies. Journal of Computational 
Biology, 7, 95-114. 

Karp, P.D., Riley, M., Saier, M., Paulsen, I. T., 
Paley, S. M., & Pellegrini-Toole, A. (2000). The 


EcoCyc and MetaCyc databases. Nucleic Acids 
Research, 28, 56-59. 

Krogh, A., Brown, M., Mian, I. S., Sjolander, K., 
& Haussler, D. (1994). Hidden Markov models in 
computational biology: Applications to protein 
modeling. Journal of Molecular Biology, 235, 
1501-1531. 

Liao, L., Kim, S., & Tomb, J-F. (2002). Genome 
comparisons based on profiles of metabolic 
pathways”, In The Proceedings of The Sixth 
International Conference on Knowledge-Based 
Intelligent Information & Engineering Systems 
(pp. 469-476). Crema, Italy: IOS Press. 

Liao, L., & Noble, W. S. (2003). Combining 
pairwise sequence similarity and support vector 
machines for detecting remote protein evolu- 
tionary and structural relationships. Journal of 
Computational Biology, 10, 857-868. 

Liberies, D. A., Thoren, A., von Heijne, G., & 
Elofsson, A. (2002). The use of phylogenetic 
profiles for gene predictions. Current Genomics, 
3, 131-137. 

Lin, J., & Gerstein, M. (2002). Whole-genome 
trees based on the occurrence of folds and ortho- 
logs: Implications for comparing genomes on dif- 
ferent levels. Genome Research, 10, 808-818. 

Marcotte, E. M., Xenarios, I., van Der Bliek, A. 
M., & Eisenberg, D. (2000). Localizing proteins 
in the cell from their phylogenetic profiles. Proc. 
Natl. Acad. Sci. USA, 97, (pp. 12115-12120). 

Mittelman, D., Sadreyev, R., & Grishin, N. (2003). 
Probabilistic scoring measures for profile-profile 
comparison yield more accurate shore seed align- 
ments. Bioinformatics, 19, 1531-1539. 

Narra, K., & Liao, L. (2004). Using extended phy- 
logenetic profiles and support vectormachines for 
protein family classification. In The Proceedings 
of the Fifth International Conference on Software 
Engineering, Artificial Intelligence, Networking, 


144 


Hierarchical Profiling, Scoring, and Applications in Bioinformatics 


and Parallel/Distributed Computing (pp. 152- 
157). Beijing, China: ACIS Publication. 

Narra, K., & Liao, L. (2005). Use of extended 
phylogenetic profiles with E-values and support 
vector machines for protein family classification. 
International Journal of Computer and Informa- 
tion Science, 6(1). 

Nevill-Manning, C. G., Wu, T. D., & Brutlag, D. 

L. (1998). Highly specific protein sequence motifs 
for genome analysis. Proc. Natl. Acad. Sci. USA, 
95(11), 5865-5871. 

Noble, W. (2004). Support vector machine applica- 
tions in computational biology. In B. Scholkopf, 
K. Tsuda, & J-P. Vert. (Eds.), Kernel methods in 
computational biology (pp. 71-92). Cambridge, 
MA: The MIT Press. 

Overbeek, R., Larsen, N., Pusch, G. D., D’Souza, 

M. , Selkov Jr., E., Kyrpides, N., Fonstein, M., 
Maltsev, N., & Selkov, E. (2000). WIT: Integrated 
system for high throughput genome sequence 
analysis and metabolic reconstruction. Nucleic 
Acids Res., 28, 123-125. 

Pearson, W. (1990). Rapid and sensitive sequence 
comparison with FASTP and FASTA. Meth. En- 
zymol., 183, 63-98. 

Pellegrini, M., Marcotte, E. M., Thompson, M. 
J., Eisenberg, D., & Yeates, T. O. (1999). Assign- 
ing protein functions by comparative genome 
analysis: Protein phylogenetic profiles. Proc. Natl. 
Acad. Sci. USA, 96, (pp. 4285-4288). 

Rabiner, L. R. (1989). A tutorial on hidden Mar- 
kov models and selected applications in speech 
recognition. Proc. IEEE, 77, 257-286. 

Sadreyev, R., & Grishin, N. (2003). Compass: A 
tool for comparison of multiple protein alignments 


with assessment of statistical significance. Journal 
of Molecular Biology, 326, 317-336. 

Scholkopf, B., & Smola, A. J. (2001). Learning 
with kernels: Support vector machines, learning). 
Cambridge, MA: The MIT Press. 

Siepel, A., & Elaussler, D. (2004). Combining 
phylogenetic and hidden Markov Models in 
biosequence analysis. J. Comput. Biol., 11(2-3), 
413-428. 

Smith, T. F., & Waterman, M. S.(1981). Identifica- 
tion of common molecular subsequences. Journal 
of Molecular Biology, 147, 195-197. 

Vapnik, V. (1998). Statistical Learning Theory. 
Adaptive and learning systems for signal process- 
ing, communications, and control. New York: 
Wiley. 

Vert, J-P. (2002). A tree kernel to analyze phylo- 
genetic profiles. Bioinformatics, 18, S276-S284. 

Watkins, C. (1999). Dynamic alignment kernels. 
In A. J. Smola, P. Bartlett, B. SchAolkopf, & C. 
Schuurmans (Ed.), Advances in large margin 
classifiers. Cambridge, MA: The MIT Press. 

Woese, C. (1987). Bacterial evolution. Microbial 
Rev., 51, 221-271. 

Zhang, K., Wang, J. T. L., & Shasha, D. (1996). On 
the editing distance between undirected acyclic 
graphs. International Journal of Foundations of 
Computer Science, 7, 43-58. 

Zhang, S., Liao, L., Tomb, J-F., Wang, J. T. L. 
(2002). Clustering and classifying enzymes in 
metabolic pathways: Some preliminary results. 
In ACM SIGKDD Workshop on Data Mining in 
Bioinformatics, Edmonton, Canada (pp. 19-24). 


This work was previously published in Advanced Data Mining Technologies in Bioinformatics, edited by H. Hsu, pp. 13-31, 
copyright 2006 by Idea Group Publishing (an imprint of IGI Global). 


145 


146 


Chapter IX 

Hierarchical Clustering Using 
Evolutionary Algorithms 


Monica Chi§ 

Avram Iancu University, Romania 


ABSTRACT 

Clustering is an important technique used in discovering some inherent structure present in data. The 
purpose of cluster analysis is to partition a given data set into a number of groups such that data in a 
particular cluster are more similar to each other than objects in different clusters. Hierarchical cluster- 
ing refers to the formation of a recursive clustering of the data points: a partition into many clusters, 
each of which is itself hierarchically clustered. Hierarchical structures solve many problems in a large 
area of interests. In this chapter a new evolutionary algorithm for detecting the hierarchical structure 
of an input data set is proposed. This problem could be very useful in economy, market segmentation, 
management, biology taxonomy, and other domains. A new linear representation of the cluster structure 
within the data set is proposed. An evolutionary algorithm evolves a population of clustering hierarchies. 
The proposed algorithm uses mutation and crossover as (search) variation operators. The final goal is 
to present a data clustering representation to find fast a hierarchical clustering structure. 


INTRODUCTION 

Clustering is an important technique used in the 
simplification of data sets or in discovering some 
inherent structure present in data. The purpose of 
cluster analysis is to partition a given data set into 
a number of groups such that data in a particular 


cluster are more similar to each other than objects 
in different clusters (Jain & Dubes, 1998). 

Clustering in data mining is a discovery proc- 
ess that groups a set of data such that the intrac- 
luster similarity is maximized and the intercluster 
similarity is minimized. These discovered clusters 
can be used to explain the characteristics of the 
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underlying data distribution and thus, serve as 
the foundation for other data-mining and analysis 
techniques. 

Clustering is an important problem, with 
applications in areas such as data mining and 
knowledge discovery, data compression and 
vector quantization, and pattern recognition and 
pattern classification. 

One-level or “flat” clustering gives no informa- 
tion about the relationship existing between clus- 
ters. Hierarchical clustering groups together data 
into a tree structure; thus, building a multilevel 
representation capable of revealing intercluster 
relationships. 

Hierarchical clustering refers to the forma- 
tion of a recursive clustering of the data points: 
a partition into many clusters, each of which is 
itself hierarchically clustered. This method is very 
useful for clustering purposes. 

Hierarchical clustering constructs trees of 
clusters of objects in which any two clusters are 
disjoint, or one includes the other. The cluster of 
all objects is the root of the tree (Jain & Dubes, 
1998; Johnson, 1967). 

Agglomerative algorithms require a definition 
of dissimilarity between clusters: the most com- 
mon ones are maximum or complete linkage, in 
which the dissimilarity between two clusters is the 
maximum of all pairs of dissimilarities between 
pairs of points in the different clusters; minimum 
or single linkage or nearest neighbor, in which the 
dissimilarity between two clusters is the minimum 
over all those pairs of dissimilarities; and average 
linkage, in which the dissimilarity between the 
two clusters is the average, or suitably weighted 
average, over all those pairs of dissimilarities. 

Agglomerative algorithms begin with an ini- 
tial set of singleton clusters consisting of all the 
objects. It proceeds by agglomerating the pair 
of clusters of minimum dissimilarity to obtain a 
new cluster, removing the two clusters combined 
from further consideration. This agglomeration 
step repeats until a single cluster containing all 
the observations is obtained. The set of clusters 


obtained along the way forms a hierarchical 
clustering (Jain & Dubes, 1998). 

The problem of hierarchical clustering as- 
sumes a significant role in a variety of research 
areas such as data mining, pattern recognition, 
economics, biology, autonomous mobile robots, 
and so forth. 

Standard methods for detecting hierarchical 
cluster structure of a set of objects are either di- 
visive or agglomerative (Dumitrescu, 1999). 

In this chapter, an evolutionary algorithm 
for detecting the hierarchical structure of a data 
set is proposed. A sequence describing a binary 
tree is used for representing the cluster hierar- 
chy. A new representation of cluster hierarchy is 
proposed. This representation is linear and may 
describe binary tree structures. An evolutionary 
algorithm with mutation and crossover as variation 
operators is used. Binary tournament selection 
is considered. 

The rest of the chapter is organized as follows. 
Section 2 gives an overview of related hierarchi- 
cal clustering algorithms. Section 3 presents the 
hierarchical clustering algorithm. Section 4 gives 
some experimental results. Section 5 draws con- 
clusions and directions for future work. 


RELATED WORK 

In this section, a brief description of some exist- 
ing hierarchical clustering algorithms is made. 
The basic process of hierarchical clustering is 
described. Previous hierarchical clustering al- 
gorithms (ROCK, CURE, and CHAMELEON) 
are discussed. 

Jonhson (1967) defined the basic process of 
hierarchical clustering. Given a set of N items to 
be clustered, and an N*N distance (or similarity) 
matrix, the basic process of hierarchical clustering 
is described as follows: 

J. Start by assigning each item to a cluster 
so that if you have N items, you now have 
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N clusters, each containing just one item. 
Let the distances (similarities) between the 
clusters be the same as the distances (simi- 
larities) between the items they contain. 

2 . Find the closest (most similar) pair of clusters 
and merge them into a single cluster, so that 
now you have one cluster less. 

3. Compute distances (similarities) between the 
new cluster and each of the old clusters. 

4. Repeat Steps 2 and 3 until all items are 
clustered into a single cluster of size N. 
There is no point in having all the N items 
grouped in a single cluster, but once you 
have got the complete hierarchical tree, if 
you want k clusters you just have to cut the 
k-1 longest links. 

Step 3 can be done in different ways (Johnson, 
1967), which is what distinguishes single- link- 
age from complete-linkage and average-linkage 
clustering. 

In single-linkage clustering, also called the 
connectedness or minimum method, the distance 
between one cluster and another cluster is con- 
sidered to be equal to the shortest distance from 
any member of one cluster to any member of the 
other cluster. If the data consist of similarities, 
the similarity between one cluster and another 
cluster is considered to be equal to the greatest 
similarity from any member of one cluster to any 
member of the other cluster. 

In complete-linkage clustering, also called 
the diameter or maximum method, the distance 
between one cluster and another cluster is con- 
sidered to be equal to the greatest distance from 
any member of one cluster to any member of the 
other cluster. 

In average-linkage clustering, the distance 
between one cluster and another cluster is con- 
sidered to be equal to the average distance from 
any member of one cluster to any member of the 
other cluster. 

A variation on average-link clustering is the 
UCLUS method of D’Andrade (D’Andrade, 1978), 


which uses the median distance, which is much 
more outlier-proof than the average distance. 

This kind of hierarchical clustering is called ag- 
glomerative because it merges clusters iteratively. 
There is also a divisive hierarchical clustering that 
does the reverse by starting with all objects in one 
cluster and subdividing them into smaller pieces. 
Divisive methods are not generally available, and 
rarely have been applied. 

Agglomerative hierarchical clustering is a 
bottom-up clustering method where clusters have 
subclusters, which in turn have subclusters, and 
so forth. The classic example of this is species 
taxonomy. 

Agglomerative hierarchical clustering ( HAC ) 
starts with one datum per cluster (singleton), 
then recursively merges the two clusters with 
the smallest distance between them into a larger 
cluster until only one cluster is left (Jain & Dubes, 
1998). The hierarchy within the final cluster has 
the following properties: 

• Clusters generated in early stages are nested 
in those generated in later stages. 

• Clusters with different sizes in the tree can 
be valuable for discovery. 

The advantages of agglomerative clustering 
are: 

• It can produce an ordering of the objects, 
which may be informative for data dis- 
play. 

• Smaller clusters are generated, which may 
be helpful for discovery. 

The disadvantages are of this method are the 
fact that no provision can be made for a reloca- 
tion of objects that may have been “incorrectly” 
grouped at an early stage. The result should be 
examined closely to ensure it makes sense. 

Use of different distance metrics for measuring 
distances between clusters may generate differ- 
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ent results. Performing multiple experiments and 
comparing the results is recommended to support 
the veracity of the original results. 

CURE and ROCK are clustering algorithms 
that belong to the class of agglomerative hierar- 
chical clustering algorithms (Guha, Rastogi, & 
Shim, 1998). 

CURE (Clustering Using REpresentatives) 
measures the similarity of two clusters based on 
the similarity of the closest pair of representative 
points belonging to different clusters, without 
considering the internal closeness of two clusters 
involved (Guha et al., 1998). 

ROCK (RObust Clustering using linKs) mea- 
sures the similarity of two clusters by comparing 
the aggregate interconnectivity of two clusters 
against a user-specified static interconnectivity 
model, and thus ignores the potential variations 
in the interconnectivity of different clusters 
within the same dataset. Most of these algorithms 
breakdown when the data consist of clusters that 
are of diverse shape, densities, sizes, noise, and 
artifacts (Guha et al., 2000). 

CHAMELEON is another hierarchical clus- 
tering algorithm that measures the similarity of 
two clusters based on a dynamic model. In the 
clustering process, two clusters are merged only 
if the interconnectivity and closeness (proxim- 
ity) between two clusters are comparable to the 
internal interconnectivity of the clusters and 
closeness of items within the clusters (Karypis, 
Han, & Kumar, 1999). 

Divisive hierarchical clustering is a top-down 
clustering method and is less commonly used. It 
works in a similar way to agglomerative clustering, 
but in the opposite direction. This method starts 
with a single cluster containing all objects, and 
then successively splits resulting clusters until 
only clusters of individual objects remain. 

In Dumitrescu, Lazzerini, and Hui (2000), 
hierarchical data structure detection using evolu- 
tionary algorithms is proposed. A linear chromo- 
some representation is used. 


EVOLUTIONARY HIERARCHICAL 
CLUSTERING 

In this section, the evolutionary hierarchical 
clustering (EvHiCA) algorithm is described. 
EvHiCA uses linear chromosome for solution 
representation. 

Solution Representation 

Let X={x 1 , x 2 ,..., x p } be a data set and d a distance 
on X. For applying an evolutionary algorithm 
for solving hierarchical classification problems, 
hierarchical clustering scheme is described. Each 
individual (candidate solution or chromosome) 
describes a hierarchy. A linear sequence is used 
for representing the cluster hierarchy. The se- 
quence is translated into a binary classification 
tree. Each node of the tree represents a potential 
cluster (a class). 

Classes of the classification tree are labeled 
by parsing each level from left to right and top 
to bottom. Root node is considered as a class 
containing the entire data set X. The root node 
has label 0. 

Data points are assigned to the terminal classes 
only. A class is called terminal if it belongs to the 
front of binary tree. Unless otherwise stated, a 
class that has no descendent is a terminal class. 
Classes corresponding to nonterminal nodes are 
called nonterminal classes. 

Solutions are represented by trees having a 
variable number of levels (variable depth). For a 
given problem, the maximum tree depth is speci- 
fied by the parameter h max . Hence, the binary tree, 
which describes the cluster hierarchy, has h 
+1 levels labeled from 0 to h . The level that 

max 

contains only a class that represents the entire 
data set X is labeled by 0. 

Each level h of the proposed binary classifica- 
tion tree contains a number of nodes, denoted by 
levnod and given by: 

levnod = 2 h (3.1) 
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The maximum number of binary tree nodes 
(classes) (different from the root node) is Nrnodes 
and is given by, 

Nrnodes = 2 hm “ +1 - 2 (3.2) 

It is not necessary to occupy every class of this 
binary tree. Finally, Nrnodes not represented the 
number of classes of proposed hierarchies. The 
number of classes could be less than Nrnodes but 
not less than 2. The number of classes does not 
include the root node, which is labeled by 0 and 
contains the entire classification data set X. 

The first step when apply an evolutionary 
algorithm for solving a problem is to find a suit- 
able representation for candidate solutions. For 
applying an evolutionary algorithm to solve a 
hierarchical classification problem, a hierarchical 
representation scheme is built. Each candidate 
solution (individual) describes a clustering hier- 
archy. Representation indicates how data points 
are assigned to classes (tree nodes). 

The individual length is equal to cardinality 
of classification data set, denoted by p. 

An individual is represented as a vector: 

c = (c i; c 2 , ..., cp (3.3) 

where c. is an integer number, 

2 h ™ x — 1 <Cj < 2 hmax+1 — 2 (3.4) 

The value c.of the gene j indicates the tree node to 
which the point x 1 is assigned. x J is assigned only 
to a terminal node (class). In this case, terminal 
nodes are all the nodes in the h level of the binary 
tree. The other nodes, called nonterminal nodes, 
are formed using the two descendents classes of 
each other by parsing each level less than h 
level from left to right and top to bottom. 

Each node in the tree corresponds to a poten- 
tial cluster. To assure that proposed hierarchical 
representation of X is a binary, well-balanced 


tree, the genes of the individual are represented 
in that way: 

If p is an odd number, then at least k = (p-l)/2 
genes of chromosome are given by, 

2 h ”“ -1<Cj < 3 • 2 hmax_1 - 1, (3.5) 

and the others are given by, 

3 ■ 2 W " 1 - 1 <Cj< 2' w+1 - 2, (3.6) 

If p is an even number, then at least k = p/2 genes 
of the chromosome are given by (3.5), and the 
others are given by (3.6). 

In order to realize a feasible hierarchical 
representation scheme, a class that has a single 
descendent that contains points from data set will 
be a terminal class, and the descendent class will 
be eliminated from the hierarchical structure. 

Example of Proposed 
Representation 

Two examples of proposed representation are 
listed nest. 

Example 1 

Consider data set X={x\ x 2 , x 3 , x 4 , x 5 } and h -2. 
Thus we may have binary trees with maximum 
(2 3 -2) nodes. Consider the individual, 

c = (3, 4, 4, 5, 6). 

This individual describes the hierarchy de- 
picted in Figure 1. 

Example 2 

Let us consider data set x 2 , . . ., x 8 , x 9 , x 10 ,} 
and h =3. The maximum binary trees nodes 
are 14. Consider the individual, 

c = (7, 8, 9, 10, 10, 11, 12, 13, 13, 14 ). 
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Figure 1. Tree encoding of the individual c = (3, 4, 4, 5, 6) 


Figure 2. Tree encoding of the individual c = (7,8,9,10,10,11,12,13,13,14) 



This individual describes the hierarchy de- 
picted in Figure 2. 

Fitness Function 

The chromosomes have to be evaluated for com- 
parison. The evaluation is done by means of a 
quality (or fitness) function. 

The data set2T={x 1 ,x 2 ,...,x p } is considered. The 
aim is to find the hierarchy that represents the 


best cluster structure of X. For this purpose, a fit- 
ness function to measure the hierarchy quality is 
needed. Fitness function must have the best value 
for the hierarchy that is the most representative 
for input data set. 

The fitness function is used as a basis for select- 
ing solutions (individuals) for recombination. 

The clustering criterion is based around mi- 
nimising the sum of squared distances of objects 
from their cluster (class) centres (prototypes) 
divided by the cardinality of each cluster (class). 
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For this purpose, Euclidean distance is used. The 
optimal clusters hierarchy of data set X corre- 
sponds to the minimal fitness function value. 
The prototype of a class A. is given by, 

I* 

V =^- (3.7) 

P, 

where p. is the cardinality of A . 

Let n be the number of classes. The fitness 
function listed next is used. 

ffrEE — < 3 - 8 > 

i=l x k EAj Pi 

Fitness function is minimised. 

Evolutionary Hierarchical Clustering 
Algorithm (EvHiCA) 

Using the proposed solution representation, an 
evolutionary algorithm is used to evolve a popu- 
lation of clustering hierarchies. Search operators 
used are crossover and mutation. 

One of the issues in evolutionary algorithms is 
the relative importance of two search operators: 
mutation and crossover. Genetic algorithms and 
genetic programming stress the role of crossover, 
while evolutionary programming and evolution 
strategies stress the role of mutation. The exis- 
tence of many different forms of crossover further 
complicates the issue. Despite theoretical analysis, 
it appears difficult to decide a priori which form 
of crossover to use, or even if crossover should be 
used at all. One possible solution to this difficulty 
is to have the EA be self-adaptive, that is, to have 
the evolutionary algorithm dynamically modify 
which forms of crossover to use and how often 
to use them, as it solves a problem (Spears, 1995; 
Spears & De Jong, 1991; Syswerda, 1989). 

EvHiCa uses a uniform crossover opera- 
tor. Uniform crossover does not use predefined 
crossover points. For each gene of an offspring, 


a global parameter indicates the probability that 
this gene should come from either the first or the 
second parent. Each position of an offspring is 
calculated separately (Back, Fogel, & Michale- 
wicz, 1997; Dumitrescu, Lazzerini, Jain, & 
Dumitrescu, 1999). 

The mutation probability stands for a parameter 
of our evolutionary algorithm. Consider p m the 
mutation rate. For each gene of the chromosome 
population, a uniform random number q is gener- 
ated. If for the z'-th gene, the condition q < p m is 
fulfilled, that gene is selected for mutation. 

Binary tournament selection is considered. 
Binary tournament selection implies that two 
individuals directly compete for selection. Tour- 
nament selection used is without reinsertion 
of the competing individuals into the original 
population. 

Evolutionary hierarchical clustering algorithm 
(EvHiCA) starts with a randomly chosen popula- 
tion of individuals. The number of individuals is 
chosen first. In the second step, the maximum 
number of levels for the hierarchy is selected. The 
following steps are repeated until a termination 
condition is reached. Two parents are chosen at 
each step using binary tournament selection. The 
selected individuals are recombined with a fixed 
crossover probability p c One offspring is obtained 
by recombining two parents. The offspring is 
mutated and the best of them replaces the worst 
individual in population. 

The algorithm keeps the best individual ob- 
tained up to each generation t. The problem solu- 
tion is the best individual obtained from the best 
individual of each generation. The best individual 
is the individual that has the minimum value for 
the fitness function (Back et al., 1997). 

NUMERICAL EXPERIMENTS 

In this section, experimental evaluation of 
EvHiCA is presented. 
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Data Sets 

EvHiCA hierarchical clustering performance is 
evaluated on two different groups of data. The 
data set is described as follows: 

1. 2-D synthetic data: This group of two-di- 
mensional synthetic data sets is proposed 
to exhibit features of hierarchical data 
structure. 

2. Real data: Four real data sets from machine 
learning repository (Blake & Merz, 1998) 
are used, which are summarized in Table 1. 
The used data are not necessarily designed 


for unsupervised (clustering) methods, but 
we include these data because they allow 
us to be more confident about any general 
conclusions according to our method. 

Algorithm Parameters 

The parameters of EvHiCa algorithm are given 

in Tables 2 and 3. 

Experimental Results 

EvHiCA is a hierarchical clustering algorithm; 

this means that it gives the number of clusters 


Table 1. Data sets 


Data set 

No of records 

No of clusters 

Number of features 

Iris 

150 

3 

4 

Wine 

178 

3 

13 

Dermatology 

366 

6 

34 

Breast Cancer Wisconsin 

699 

2 

10 


Table 2. EvHiCA algorithm parameters for solving hierarchical clustering problems 


Parameter 

Value 

Population Size 

According to dataset used 

Chromosome length 

The cardinality of dataset 

Crossover probability 

0.9 

Crossover type 

Uniform crossover 

Mutation probability 

0.5, 0.1, 0.8 


Table 3. EvHiCA algorithm parameters for solving hierarchical clustering problems with a large da- 
taset 


Parameter 

Value 

Population Size 

250 

Chromosome length 

The cardinality of dataset 

Crossover probability 

0.9 

Crossover type 

Uniform crossover 

Mutation probability 

0.9 
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and the point in each cluster and also produces 
a dendrogram of possible clustering solutions at 
different levels of granularity. 

For the data sets used in the 2-D space, the 
accuracy of the solution is 90 %. This means that 
the clustering is correct in most cases, but some- 
times the hierarchy is not good. Generally, the 
number of misclassified points increases with the 
number of points and decreases with the number 
of generations. 

For the synthetic dataset presented in Figure 
3, EvHiCA gives the following dendrogram. 

For this kind of data set, the hierarchy is cor- 
rectly established and the data points are correctly 
classified. 

The results of the real datasets are discussed 
as follows. 


Figure 3. Dataset 1 in the 2-D space 



Figure 4. Dendrogram for datasets in figure 3 


Iris dataset is a classical dataset used in dis- 
crimination tasks (Fisher, 1936). For Iris data set, 
a perfect clustering would show 50 in each of the 
three clusters and all of same species were in the 
same cluster. Using our algorithm, the results are, 
for Setosa, all 50 cases are clustered in the same 
cluster; for Virginica, 42 cases were in cluster 
with this kind of data and 8 was misclassified in 
cluster. Versicolor showed 23 correctly classified, 
but 27 misclassified as Virginica. 

It is very important to establish the maximum 
level, which is a parameter of the proposed al- 
gorithm, according to number of points in the 
classification data set. The algorithm is faster and 
the results are correctly classified after a great 
number of generations. 

The Wine data represents 13 different chemi- 
cal constituents of 178 Italian wines derived from 
three different cultivars. The data of the Wine 
set is normalized to facilitate the calculation of 
Euclidian distance. Wine datasets consist of three 
clusters, 178 data points, and each data point has 
13 attributes. The hierarchical clustering result is 
displayed as a dendrogram. The data that belong 
to the same cluster will be in the same subtree. 
Figure 3 presents the clustering results of the ap- 
plication of EvHiC A on the Wine data. The label 
A, B, C represents three kinds of wine. Figure 
5 shows that 2 class “A” Wine data have been 
misclustered, 14 class of “B” Wine data have 
been misclustered, and 4 class “C” Wine data 
have been misclustered in the hierarchical trees 
result from our clustering algorithm. 

Experimental results with Dermatology da- 
taset and Breast Cancer Wisconsin dataset are 
analyzed (Wolberg & Mangasarian, 1990). The 
Breast Cancer Wisconsin data has a total num- 
ber of instances in this dataset of 699. Number 
of attributes is 10. Records belong to one of two 
classes. The classes are benign and malignant. 
Four hundred and fifty eight records belong to 
class benign and 241 records belong to class 
malignant. 
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Figure 5. The EvHiCA hierarchical clustering result of Wine data set 
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The clustering result is displayed as a dendro- 
gram. The clustering results with EvHiCA show 
that 106 (23.14 %) of class benign are misclustered 
and 37 of malignant classes are misclustered. 
(15.35 %). 

The differential diagnosis of erythmato-sqa- 
mous is a real problem in dermatology. Types 
share the clinical features of erythema and scaling, 
with very few differences (Wolberg & Mangasar- 
ian, 1990). Dataset contains 34 attributes, out of 
which 33 are linear and one is nominal. Dataset 
consists of 366 records, classified into six classes 
of diseases. Total records belonging to each class 
are 112, 61,72,49, 52, and 20, respectively. Distri- 
bution of datasets among classes is proportional, 
and thus all the classes are subdivided into four 
different subsets. 

The EvHiCA find after 150 generations these 
four subsets in a correct way. Only 5-10 % of each 
class is incorrectly classified. 

Solution accuracy generally increases with 
the number of generations. Usually, it detects 
structures that represent well-balanced binary 
hierarchies. 


CONCLUSION AND FUTURE 
RESEARCH 

In this chapter, a new evolutionary technique for 
detecting hierarchical structure in a dataset was 
presented. 

The importance of this chapter is the new 
representation of the hierarchical structure and 
the fitness function. 

Further research will explore different fitness 
functions and restrictions in order to provide the 
ability to handle large data collections, and will 
try to adapt the genetic operators for increasing 
the results. A correspondence in the maximum 
level of the proposed binary classification tree 
and the cardinality of data set will be finding. The 
maximum level will be encoding in the chromo- 
some representation. 

Another future work is to define special 
genetic operators that improve the hierarchical 
structures scheme. 
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ABSTRACT 

In this chapter, we present genetic-a\gorithm(GA)-based methods developed for clustering univariate 
time series with equal or unequal length as an exploratory step of data mining. These methods basically 
implement the k-medoids algorithm. Each chromosome encodes, in binary, the data objects serving as the 
k-medoids. To compare their performance, both fixed-parameter and adaptive GAs were used. We first 
employed the synthetic control-chart data set to investigate the performance of three fitness functions, 
two distance measures, and other GA parameters such as population size, crossover rate, and mutation 
rate. Two more sets of time series with or without a known number of clusters were also experimented: 
one is the cylinder-bell-funnel data and the other is the novel battle simulation data. The clustering 
results are presented and discussed. 


INTRODUCTION 

Before prediction models can be built in data min- 
ing or knowledge discovery, it is often advisable 
to first explore the data. Clustering is known to be 
a good exploratory data-mining tool. The goal of 
clustering is to create structure for unlabeled data 
by objectively forming data into homogeneous 
groups, where the within-group object similarity 
and the between-group object dissimilarity are 


optimized. The bulk of clustering analyses has 
been performed on data associated with static 
features, that is, feature values that do not change 
with time, or the changes are negligible. 

Two major classes of clustering methods are 
partitioning and hierarchical clustering. Well- 
known partitioning-based clustering methods 
include /(-means (MacQueen, 1967), k-medoids 
(Kaufman & Rousseeuw, 1990), and the cor- 
responding fuzzy versions: fuzzy c-means 
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(FCM) (Bezdek, 1987) and fuzzy c-medoids 
(Krishnapuram, Joshi, Nasraoui, & Yi, 2001). 
Hierarchical clustering methods are either of the 
agglomerative type or the divisive type. Lately, 
soft computing technologies, including neural 
networks and genetic algorithms, have emerged 
as another class of clustering techniques. Two 
prominent methods of the neural network approach 
to clustering are competitive learning and self- 
organizing feature maps. Most genetic clustering 
methods implement the spirit of partitioning meth- 
ods, especially the k-means algorithm (Krishna & 
Murty, 1999; Maulik & Bandyopadhyay, 2000), 
and the fuzzy c-means algorithm (Hall, Ozyurt, 
& Bezdek, 1999). 

Just like static feature data, forming groups of 
similar time series given a set of unlabeled time 
series is often desirable. These unlabeled time 
series could be monitoring data collected during 
different periods from a particular process, or from 
several differentprocesses. These processes could 
be natural, engineered, business, economical, or 
medical related. Unlike static feature values, the 
time series of a feature consists of dynamic values, 
that is, values changed with time. This greatly 
increases the dimensionality of the problem and 
calls for somewhat different, and often more 
complicated, clustering methods. This study will 
focus only on time-series data. 

In surveying work related to time-series 
clustering, Liao (2005) distinguished three dif- 
ferent time-series clustering approaches: those 
working with full data either in the time or fre- 
quency domain, those working with extracted 
features, and model-based approaches with 
models built from the raw data. An example of 
the first approach is Golay et al. (Golay, Kollias, 
Stoll, Meier, Valavanis, & Boesiger, 1998). They 
applied the fuzzy c-means algorithm to provide 
the functional maps of human brain activity 
on the application of a stimulus. In their study, 
three different distances (the Euclidean distance 
and two cross-correlation-based distances) were 


alternately used for comparison purposes. Goutte 
etal. (Goutte, Toft, & Rostrup, 1999) and Fu etal. 
(Fu, Chung, Ng, & Luk, 2001) took the feature- 
based approach. Goutte etal. clustered functional 
magnetic resonance imaging (fMRl) time series 
in groups of voxels with similar activations using 
two algorithms: k-means and Ward’s hierarchi- 
cal clustering. The cross-correlation function, 
instead of the raw fMRl time series, was used 
as the feature space. Fu et al. described the use 
of self-organizing maps for grouping similar 
temporal patterns dispersed along the time series. 
Two enhancements were made: consolidating the 
discovered clusters by a redundancy removal step, 
and introducing the perceptually important point- 
identification method to reduce the dimension of 
the input data sequences. 

Three model-based time-series clustering 
methods are described next. Li and Biswas (1999) 
described a clustering methodology for temporal 
data using hidden Markov model representation 
with a sequence-to-model likelihood distance 
measure. The temporal data was assumed to 
have the Markov property. Time series were con- 
sidered similar when the models characterizing 
individual series were similar. Policker and Geva 
(2000) presented a model for nonstationary time 
series with time varying mixture of stationary 
sources, comparable to the continuous hidden 
Markov model. Fuzzy clustering methods were 
applied to estimate the continuous drift in the 
time-series distribution, and the resultant mem- 
bership matrix was given an interpretation as 
weights in a time varying, mixture probability 
distribution function. Kalpakis et al. (Kalpakis, 
Gada, & Puttagunta, 2001) studied the clustering 
of AR1MA time-series by using the Euclidean 
distance between the Linear Predictive Coding 
Cepstra of two time-series as their dissimilarity 
measure and the Partition around Medoids (PAM) 
method as the clustering algorithm. 

To the best of our knowledge, the only study 
that applied genetic algorithms to cluster time 
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series is Baragona (2001). He evaluated three me- 
taheuristic methods to partition a set of time series 
into clusters in such a way that (1) the maximum 
absolute cross-correlation value between each 
pair of time series that belong to the same cluster 
is greater than some given threshold, and (2) the 
k-min cluster criterion is minimized with a speci- 
fied number of clusters. The cross-correlations 
were computed from the residuals of the models 
of the original time series. Among all methods 
evaluated, Tabu search was found to perform 
better than single linkage, pure random search, 
simulation annealing, and genetic algorithms, 
based on a simulation experiment on 10 sets of 
artificial time series generated from low-order 
univariate and vector ARMA models. For each 
time series, 300 observations were generated. 

This study proposes k-medoids-based genetic 
algorithms for clustering time-series data of equal 
or unequal length as an exploratory step of data 
mining. Though few selected time series were 
tested in this study due to space constraint, the 
methods proposed can definitely be generalized 
to time-series data in other domains. We chose 
to directly process raw data to avoid the need 
either to extract features or to fit some appropri- 
ate models for the data at hand. Our GA differs 
from that of Baragona in three major aspects: (1) 
working with the original data rather than the 
residuals, (2) capable of handling unequal length 
of time series by using the dynamic time warping 
distance rather than the cross-correlation func- 
tion, and (3) implementing different GAs. In the 
next two sections, the proposed genetic cluster- 
ing method and the distance measures used are 
presented. The subsequent three sections present 
the test results of the synthetic control-chart data, 
cylinder-bell-funnel data, and battle simulation 
data, respectively. A discussion then follows and 
finally, the chapter is concluded. 


GENETIC CLUSTERING OF 
TIME-SERIES DATA 

Genetic algorithms have the following elements: 
population of chromosomes, selection according 
to fitness, crossover to produce new offspring, 
and random mutation of new offspring (Mitchell, 
1996). In this section, the proposed genetic algo- 
rithms for clustering time-series data are detailed 
element by element. 

In summary, four different chromosome rep- 
resentations have been employed by the genetic 
clustering techniques that implemented either 
the k-means or fuzzy c-means algorithm. They 
include integer-coded cluster for each datum with 
length equal to the number of data points, as in 
Krishna and Murty (1999); real-coded cluster 
centers, as in Maulik and Bandyopadhyay (2000) 
and Bandyopadhyay and Maulik (2002); binary- 
coded cluster centers (in gray coding), as in Hall 
et al. (1999); and binary-coded representation 
of p-median problems with length equaling to 
the number of data points, in which a digit value 
of “1” denotes a median and “0” not a median 
(Lorena & Furtado, 2000). These chromosome 
representation methods can be extended to time- 
series data, though they were initially designed 
for static feature data. 

We implemented the k-medoids algorithm 
rather than k-means (or fuzzy c-means) because 
of the difficulty involved in defining the cluster 
centers for time series with unequal length. Since 
the data dimension is relatively large in our 
application, either binary-coded or real-coded 
cluster centers are inappropriate. Therefore, we 
elected to use a binary-coded representation of 
data objects serving as the cluster medoids. (The 
other alternative is integer-coded representation, 
which we will investigate, and hope to share the 
results once they become available). The prespeci- 
fied number of cluster medoids and the number 
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of digits used to represent each medoid together 
determine the chromosome length. 

Each chromosome in the population is evalu- 
ated in two steps: first distributing each data point 
to the closest medoid according to some distance 
measure, and then computing the fitness value. 
The dynamic time warping distance, d mw , is 
chosen because of its ability to handle time series 
with varying lengths. More details are presented 
in the next section. The cluster of each datum is 
determined to be the one medoid closest to it, 
based on the nearest neighbor concept. Three 
fitness functions, as given next, are evaluated in 
this study. 

1. 10 s / (TWCV x (1 + nbrO x a large inte- 
ger)), where TWCV and nbrO denote the 
total within-cluster distance and number of 
clusters with zero members, respectively. 
Krishna and Murty (1999) used the total 
within-cluster distance for static feature 
values but not time series. Our implementa- 
tion differs from theirs also in the distance 
measure (dynamic time warping instead of 
Euclidean), the clustering algorithm (1-me- 
doids instead of k-means), the chromosome 
representation, and other GA details. Letv., i 
=1, . . ., cbe the cluster medoids representing 
cluster C., z =1, . . ., c and x j = 1,. . ., n be the 
data vectors. If x. e C., then w = 1; else, w.. 

j ‘ y u 

= 0. In this study, the TWCV is computed 
as: 

TWCV = f j WCV i =f j f j w ij d DTW (v i ,x ] ). 

i = 1 i=l j = 1 

(l) 

2. 10 6 / (DB x (1 + nbrO x a large integer)), 
where DB is the Davies-Bouldin index that 
was initially proposed as a measure of the 
validity of the clusters by Davies and Boul- 
den(1979)andlaterusedbyBandyopadhyay 
and Maulik (2002) in their study of genetic 
clustering of satellite images. Let n. denote 


the number of data points in cluster z. For 
this study, the DB index is modified as: 


1 c c 

DB = — V max 

C “f J=D*‘ 


WCV i / n + WCVj / n j 

d DTW i V i ’ V j ) 

( 2 ) 


3. 


10 6 / (R m (V) x (1 + nbrO x a large integer)), 
where R m (V) is the reformulated FCM func- 
tional (Hall et al, 1999). In the following 
equation, we replace the original Euclidean 
distance with the DTW distance and m is 
the fuzzy weight (m>l). 


n f c 1 ^ 1_m 

= Z tdDTwiy^r m) 


i = H i=i 


( 3 ) 


Note that each fitness function includes a 
penalty term to discourage the formation 
of clusters with zero members (“a large 
integer” was consistently set at 10 in this 
study). For comparing time-series data with 
equal length, Euclidean distance, d E , was 
also implemented. In this case, d DTW in the 
previous equations, is replaced by d E . 

Standard roulette-wheel selection is used to 
reproduce offspring for the next generation. Each 
current chromosome in the population has a rou- 
lette-wheel slot, sized in proportion to its fitness. 
Chromosomes with a higher fitness value thus 
have a higher probability of contributing more 
offspring. Pairs of chromosomes are randomly 
chosen to perform the one-point crossover opera- 
tion according to the specified crossover rate. The 
mutation is then performed to flip some of the bits 
in a chromosome from “0” to “1” or “1” to “0,” 
according to the specified rate. If the resultant 
chromosome contains an invalid cluster medoid, 
its fitness is then set to zero to prevent it surviving 
to the next generation (a simple repair procedure 
to take care of the infeasible solutions). 
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The GA process has the following steps: 

• Set the generation value to zero. 

• Initialize the population. 

• Evaluate the population. 

• While the maximum number of generations 
is not reached, 

• Select chromosomes. 

• Perform the crossover operation. 

• Perform the mutation operation. 

• Evaluate the new pool of chromosomes. 

• Increment the generation value. 

Several GA parameters need to be set for the 
GA process to run. They include the population 
size, s, the crossover rate, p c , the mutation rate,p m , 
and the maximum number of generations, g max . 
In addition, a different selection strategy could 
be used. Proper selection of these parameters 
greatly affects the GA behavior that is strongly 
determined by the balance between exploration 
(to explore new and unknown areas in the search 
space) and exploitation (to make use of knowledge 
acquired by exploration to reach better positions 
in the search space). Most GA studies choose 
these parameters by trial and error, without a 
systematic investigation. 

Attempts to find the optimal and general set 
of parameters have been made by testing a wide 
range of problems (Grefenstette, 1986). To de- 
termine the relevancy and relative importance of 
these parameters, Rojas et al. (Rojas, Gonzalez, 
Pomares, Merelo, Castillo, & Romero, 2002) 
applied the analysis of the variance (ANOVA) 
technique. The response variables used to perform 
the statistical analysis were the maximum fitness 
in the last generation that measures the capacity 
to find a local/global optimum, and the average 
fitness in the last generation that measures the 
diversity in the population. In terms of the best 
solution, all variables were found significant, 
with the first three most significant ones being 
the selection operator, the population size, and 
the type of mutation. Regarding the diversity, 


the significant variables in descending order 
were the type of selection operators, the muta- 
tion rate, the mutation type, and the number of 
generations. These results were obtained based on 
their tests on a 0/1 knapsack problem, the Riolo 
function, the prisoner’s dilemma problem, and 
three Michalewicz’s functions. 

Another school of approaches for setting GA 
parameters is using some mechanism to adapt 
them depending upon the state of the GA learning 
process, instead of fixing them from the outset. 
Srinivas and Patnaik (1994) proposed the adap- 
tive genetic algorithm (AGA) for multimodal 
function optimization to realize the dual goals 
of maintaining diversity in the population and 
sustaining the convergence capacity of the GA. 
The AGA varies the probabilities of crossover and 
mutation depending upon the fitness values of the 
solutions. Let f andf be the maximum fitness 
and the average fitness of the entire population, f ’ 
be the larger of the fitness values of the solutions 
to be crossed, and f be the fitness values of the 
solutions to be mutated. The expressions for p c 
and p m are given as, 

_ \K ( Lax - f) /( f max ~ favg) ’ 'T f ^ favg 
PC 1 *3, f’< favg 

(4) 

and 

K ( f max - f ) /( fmax - favg )> f > favg 

K > ^ f < favg 

(5) 

where k ]t k„ k y k 4 < 1.0. They set k 2 and k 4 to 
0.5 to ensure the disruption of those solutions 
with average or subaverage fitness. To force all 
solutions with a fitness value less than or equal 
to the average fitness to undergo crossover, they 
assigned /q and k 3 a value of 1. 

Eiben et al. (Eiben, Hinterding, & Mizhale- 
wicz, 1999) classified parameter control (or pa- 
rameter adaptation) studies based on two aspects: 
how the mechanism works and what component 
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of the evolutionary algorithms (that includes GA) 
is affected by the mechanism. They classified pa- 
rameter control mechanisms into three categories: 
deterministic in the sense that parameter values 
are altered by some deterministic rule, adaptive 
by using feedback from the search to determine 
the direction and/or magnitude of the change, 
and self-adaptive by encoding the parameters 
into the chromosomes that undergo mutation and 
recombination. They identified six components 
being adapted: representation, evaluation func- 
tion, mutation operators and their probabilities, 
crossover operators and their probabilities, par- 
ent selection, and replacement operator. Most 
previous parameter adaptation studies used one 
mechanism to adapt one or two components. For 
example, the work of Srinivas and Patnaik (1994) 
employed an adaptive mechanism to vary two 
components: mutation rate and crossover rate. 
To date, there are insufficient research studies 
and results to conclude how much parameter 
control is most useful. Herrera and L ozano (2003) 
reviewed different aspects of fuzzy adaptive 
genetic algorithms (FAGA) from three points of 
view: design, taxonomy, and future directions. 
The steps for designing the fuzzy logic control- 
ler used by FAGAs were shown with an example. 
They categorized FAGAs based on two criteria: 
how the rule base is obtained, and the level where 
the adaptation takes place. They also discussed 
future directions and some challenges for FAGA 
research. 

The parameter adaptation approaches have be- 
come more popular than the selection approaches 
as one gradually realizes the difficulty in coming 
up with a general rule, and that different prob- 
lems really require different GA parameters for 
satisfactory performance. Therefore, this study 
will employ a parameter adaptation approach. 
Specifically, we employ the AGA proposed by 
Srinivas and Patnaik (1994) to adapt both the 
crossover rate and the mutation rate (or only the 
mutation rate) in order to evaluate its performance 
in clustering time-series data. Proposing a new 


parameter adaptation method is beyond the scope 
of this study. Nevertheless, both the results with 
adapted parameters and those with fixed param- 
eters are obtained and compared. 

SIMILARITY/DISTANCE MEASURES 

One key issue in clustering is how to measure 
the similarity between two data objects being 
compared. For static feature values, Euclidean 
distance or generalized Mikowski distance is 
often used. The Euclidean distance has been used 
to measure the distance between two time series 
of the same length, for example, by Pham and 
Chan (1998). The Euclidean distance or general- 
ized Mikowski distance is applicable only when 
the lengths of time-series data are equal, which 
is not the case for the battle simulation data to be 
studied in the sequel. Therefore, we resort to the 
dynamic-time-warping distance that is known 
capable of coping with unequal time series. 

Dynamic time warping (DTW) is a generaliza- 
tion of classical algorithms for comparing discrete 
sequences to sequences of continuous values. 
Given two time series, Q - q l q 2 q. q n and C 
-c 1 c 2 . . ., c. c m , DTW aligns the two series so 
that their difference is minimized. To this end, an 
n by m distance matrix was used in which the (i, 
j ) element contains the distance d(q. c ) between 
two points q. and c. that is often measured by the 
Euclidean distance. A warping path, W=w 1 w 2 
w k w K where max(n, m) < K < m+n-1, is a set 
of matrix elements that satisfies three constraints: 
boundary condition, continuity, and monotonicity. 
The boundary condition constraint requires the 
warping path to start and to finish in diagonally 
opposite corner cells of the matrix. That is w = 
(1, 1) and w R = (n, m). The continuity constraint 
restricts the allowable steps to adjacent cells. 
The monotonicity constraint forces the points 
in the warping path to be monotonically spaced 
in time. Of interest is the warping path that has 
the minimum distance between the two series. 
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Mathematically, 

K 

Zn 

d DTW = (6) 

Dynamic programming can be used to effec- 
tively find this path by evaluating the recurrence 
function given as equation (7), which defines the 
cumulative distance as the sum of the distance 
of the current element and the minimum of the 
cumulative distances of the adjacent elements: 

d C um( i ’j) = d(q i ,c j ) 

+ min{d cum (z - 1 , j - 1), d cum (z - 1 , j), d cum (z, j- 1) } 

( 7 ) 

CLUSTERING RESULTS OF 
CONTROL-CHART DATA 

The genetic clustering methods were first applied 
to clustering 30 synthetic control-chart data, taken 
from the UCf Data Mining Archive (http://kdd.ics. 
uci.edu/), that were initially generated by Pham 
and Chan (1998). Figure 1 shows the 30 synthetic 
control-chart data used in this study. There are 
six known patterns (or clusters): normal, cyclic, 


increasing trend, decreasing trend, upward shift, 
and downward shift. For each pattern, there are 
five time series. It is not difficult to see that there 
is some overlapping between similar patterns (ad- 
jacent clusters), for example, between increasing 
trend and upward shift. We were able to compute 
the clustering accuracy rates for the control-chart 
data set because the ground truth is known. Just 
like any clustering algorithm, the proposed genetic 
clustering algorithm arbitrarily labels each cluster. 
This inconsistent labeling makes the accuracy- 
checking task somewhat tedious. 

Table 1 summarizes the clustering results of 
using fixed parameter GAs. The fixed GA pa- 
rameters include population size of 20, maximum 
generation of 30, crossover rate of 0.8, and three 
mutation rates: 0.01, 0.05, and 0.1. 

For each combination of GA parameters, five 
repetitions were made with different random 
seeds. Among all the fixed parameter GAs tested, 
the highest average clustering accuracy of 0.733 
was obtained when using the T WC V-based fitness 
function, DTW distance measure, and mutation 
rate of 0.01. This particular combination also 
produced the highest clustering accuracy of 
86.7% among all runs executed for this dataset. 


Figure 1. 30 synthetic control-chart data 
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Table 1. Clustering results of control-chart data by fixed parameter GAs 


Fitness Function 

Distance 

Mutation Rate 

Run Accuracy 

Avg. Accuracy 

TWCV 

Euclidean 

0.01 

.667, .667, .8, .567, .667 

.673 

TWCV 

Euclidean 

0.05 

.633, .667, .7, .7, .633 

.667 

TWCV 

Euclidean 

0.1 

.667, .733, .633, .7, .7 

.687 

TWCV 

DTW 

0.01 

.667, .867, .767, .7, .667 

.733 

TWCV 

DTW 

0.05 

.667, .7, .6, .7, .667 

.667 

TWCV 

DTW 

0.1 

.6, .733, .633, .8, .667 

.687 

DB 

Euclidean 

0.01 

.4, .6, .467, .533, .467 

.493 

DB 

Euclidean 

0.05 

.433, .6, .433, .467, .433 

.473 

DB 

Euclidean 

0.1 

.433, .4, .467, .5, .467 

.453 

DB 

DTW 

0.01 

.567, .633, .5, .7, .667 

.601 

DB 

DTW 

0.05 

.6, .6, .467, .633, .567 

.573 

DB 

DTW 

0.1 

.533, .567, .667, .567, .6 

.587 

Rm 

Euclidean 

0.01 

.467, .533, .533, .567, .7 

.560 

Rm 

Euclidean 

0.05 

.8, .633, .733, .8, .533 

.700 

Rm 

Euclidean 

0.1 

.5, .667, .533, .633, .633 

.593 

Rm 

DTW 

0.01 

.533, .6, .567, .6, .567 

.573 

Rm 

DTW 

0.05 

.533, .667, .567, .567, .5 

.567 

Rm 

DTW 

0.1 

.567, .533, .533, .533, .633 

.560 


Figure 2. Cluster means of a selected fixed parameter GA run 
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Figure 2 shows the cluster medoids of one of the 
five runs for this particular fixed parameter GA. 
Note that the six patterns can be clearly seen in 
the figure. 

Table 2 summarizes the clustering results of 
using AGAs that adapt both crossover and muta- 
tion rates. For the AGAs, the GA parameter fixed 
is maximum generation of 30. Factors varied are 
fitness function, distance measure, and population 
size. For each combination of GA parameters, 
five repetitions were made with different random 
seeds. Among all the AGAs tested, the highest 
average clustering accuracy of 0.7 was attained 
when using the TWCV-based fitness function, 
Euclidean distance, and population size of 60. Note 
that the highest clustering accuracy produced by 
AGA is 83.3%, which was produced by using the 
Rm - based fitness function, DTW distance, and 
population size of 20. Figure 3 shows the cluster 


medoids of one of the five runs for this particular 
AGA. Note that the six patterns can also be clearly 
seen as in Figure 2. 

To determine the effect of fitness function, 
distance measure, and GA parameter, we per- 
formed ANOVA tests on the clustering results 
given in Tables 1 and 2. The results indicate that 
for both fixed parameter GAs and AGAs, the 
fitness function, and the interaction between the 
fitness function and the distance measure are 
highly significant (with p value < 0.005). 

To evaluate the performance of adapting only 
one parameter, we also tried the adaptive GA 
that adapts only the mutation rate. Three levels 
of crossover rates were experimented: 0.7, 0.8, 
and 0.9. The two fixed parameters are the popu- 
lation size at 20 and the maximum number of 
generation at 30. Table 3 summarizes the cluster- 
ing accuracies. Among all the AGAs tested, the 


Table 2. Clustering results of control-chart data by AGA 


Fitness 

Function 

Distance 

Population Size 

Run Accuracy 

Avg. Accuracy 

TWCV 

Euclidean 

20 

.633, .7, .667, .733, .633 

.673 

TWCV 

Euclidean 

40 

.667, .767, .633, .767, .633 

.693 

TWCV 

Euclidean 

60 

.667, .733, .7, .7, .7 

.700 

TWCV 

DTW 

20 

.633, .7, .7, .6, .6 

.647 

TWCV 

DTW 

40 

.7, .6, .667, .6, .667 

.647 

TWCV 

DTW 

60 

.667, .633, .6, .667, .733 

.660 

DB 

Euclidean 

20 

.433, .633, .367, .467, .567 

.493 

DB 

Euclidean 

40 

.5, .4, .433, .5, .467 

.460 

DB 

Euclidean 

60 

.433, .4, .5, .4, .467 

.440 

DB 

DTW 

20 

.6, .667, .667, .533, .5 

.593 

DB 

DTW 

40 

.533, .667, .633, .633, .6 

.601 

DB 

DTW 

60 

.5, .633, .633, .633, .6 

.600 

Rm 

Euclidean 

20 

.533, .633, .767, .667, .533 

.627 

Rm 

Euclidean 

40 

.533, .633, .767, .667, .533 

.627 

Rm 

Euclidean 

60 

.533, .533, .533, .533, .633 

.553 

Rm 

DTW 

20 

.7, .7, .6, .533, .833 

.673 

Rm 

DTW 

40 

.533, .633, .633, .7, .467 

.593 

Rm 

DTW 

60 

.567, .567, .5, .567, .5 

.540 
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Figure 3. Cluster means of a selected AGA run 
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Table 3. Clustering results of control-chart data by AGA that adapts mutation rate 


Fitness 

Function 

Distance 

Crossover Rate 

Run Accuracy 

Avg. Accuracy 

TWCV 

Euclidean 

0.7 

.6, .7, .633, .633, .733 

.660 

TWCV 

Euclidean 

0.8 

.7, .7, .7, .633, .633 

.673 

TWCV 

Euclidean 

0.9 

.7, .733, .7, .733, .677 

.707 

TWCV 

DTW 

0.7 

.7, .633, .7, .633, .7 

.673 

TWCV 

DTW 

0.8 

.8, .567, .633, .8, .677 

.693 

TWCV 

DTW 

0.9 

.7, .733, .833, .6, .677 

.707 

DB 

Euclidean 

0.7 

.6, .467, .467, .433, .467 

.487 

DB 

Euclidean 

0.8 

.467, .467, .367, .467, .467 

.447 

DB 

Euclidean 

0.9 

.6, .467, .567, .433, .467 

.507 

DB 

DTW 

0.7 

.633, .567, .533, .733, .667 

.627 

DB 

DTW 

0.8 

.5, .6, .633, .567, .633 

.587 

DB 

DTW 

0.9 

.6, .667, .667, .6, .633 

.633 

Rm 

Euclidean 

0.7 

.633, .667, .533, .767, .6 

.640 

Rm 

Euclidean 

0.8 

.567, .567, .533, .633, .8 

.620 

Rm 

Euclidean 

0.9 

.633, .567, .533, .567, .567 

.573 

Rm 

DTW 

0.7 

.733, .7, .6, .533, .5 

.613 

Rm 

DTW 

0.8 

.667, .567, .6, .567, .5 

.580 

Rm 

DTW 

0.9 

.733, .7, .6, .533, .533 

.620 
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highest average clustering accuracy was attained 
at 70.7% when using the TWCV-based fitness 
function and crossover rate of 0.9, regardless the 
distance measure. The highest clustering accuracy 
produced by AGA is 83.3%, which was produced 
by using the TWCV-based fitness function, DTW 
distance, and crossover rate of 0.9. The ANOVA 
test indicates thatthe fitness function, the distance 
measure, and their interaction are significant in 
affecting the clustering accuracy (with p value < 
0.005). From Tables 2 and 3, it can be observed 
that adapting both crossover rate and mutation 
rates does not have any advantage over adapting 
only the mutation rate in terms of finding the 
highest clustering accuracy. 

Comparing Table 2 (only those results based 
on population size of 20) and Table 3 (only those 
results based on crossover rate of 0. 8) with Table 1 , 
one can easily see that for the control-chart data, 


AGAs do not always perform better than fixed pa- 
rameter GAs. Depending upon the combination of 
fitness-function and distance measure, AGA could 
be better than all, none, or some fixed parameter 
GAs tested. Therefore, howto devise an AGA that 
always performs better than all fixed-parameter 
GAs does require further investigation. 

CLUSTERING RESULTS OF 
CYLINDER BELL FUNNEL DATA 

This section presents results obtained from one 
relatively larger data set of univariate time series 
with known number of clusters. The data set con- 
tains 300 series generated by implementing the 
cylinder, bell, and funnel equations given in the 
UCR Time Series Data Mining Archive (http:// 
www.cs.ucr.edu/~eamonn/TSDMA/). One hun- 
dred series were generated for each pattern, with 
each series having a length of 80 data points. 


Table 4. Clustering results of cylinder-bell-funnel data by AGA that adapts mutation rate 


Fitness 

Function 

Distance 

Crossover Rate 

Run Accuracy 

Avg. Accuracy 

TWCV 

Euclidean 

0.7 

.823, .843, .823, .873, .853 

.843 

TWCV 

Euclidean 

0.8 

.877, .820, .847, .883, .807 

.847 

TWCV 

Euclidean 

0.9 

.867, .857, .840, .847, .857 

.854 

TWCV 

DTW 

0.7 

.607, .537, .647, .460, .710 

.592 

TWCV 

DTW 

0.8 

.643, .570, .543, .647, .760 

.633 

TWCV 

DTW 

0.9 

.757, .763, .640, .517, .660 

.667 

DB 

Euclidean 

0.7 

.800, .817, .827, .817, .847 

.822 

DB 

Euclidean 

0.8 

.830, .820, .713, .817, .693 

.775 

DB 

Euclidean 

0.9 

.863, .880, .823, .833, .880 

.856 

DB 

DTW 

0.7 

.517, .607, .527, .563, .540 

.551 

DB 

DTW 

0.8 

.533, .617, .553, .557, .567 

.565 

DB 

DTW 

0.9 

.593, .503, .487, .547, .543 

.535 

Rm 

Euclidean 

0.7 

.623, .610, .653, .557, .527 

.594 

Rm 

Euclidean 

0.8 

.777, .610, .710, .697, .773 

.713 

Rm 

Euclidean 

0.9 

.703, .617, .670, .760, .627 

.675 

Rm 

DTW 

0.7 

.520, .550, .520, .603, .650 

.569 

Rm 

DTW 

0.8 

.777, .570, .543, .613, .493 

.599 

Rm 

DTW 

0.9 

.443, .607, .483, .560, .493 

.515 


167 


Exploratory Time Series Data Mining by Genetic Clustering 


We run the adaptive GA (that adapts only 
the mutation rate) 18 times by varying the fit- 
ness function, the distance measure, and the 
crossover rate. In each run, the GA was repeated 
five times. Three levels of crossover rates were 
experimented: 0.7, 0.8, and 0.9. The two fixed 
parameters are the population size at 60 and the 
maximum number of generation at 30. Table 4 
summarizes the clustering results. The ANOVA 
test on these results reveals that the fitness func- 
tion, the distance measure, and their interaction 
are significant in affectingthe clustering accuracy 
(with p value < 0.005). Obviously, the Euclidean 
distance outperforms the dynamic time warping 
for this dataset. As far as the fitness function is 
concerned, the TWCV-based fitness function 
is the best and the R -based index is the worst. 

m 

The best average clustering accuracy is 85.6%, 
obtained by the combination using the DB -based 
fitness, Euclidean distance, and crossover rate 
of 0.9. The highest clustering accuracy among 
all runs is 88.3%, which was produced in one of 
the five replicates by the combination using the 
TWCV-based fitness, Euclidean distance, and 
crossover rate of 0.8. 


CLUSTERING RESULTS OF 
BATTLESIMULATION DATA 

The OneSAF combat simulation software was 
used to create a battle scenario for our experi- 
ments (Heilman et ai, 2002). Time-series data 
were collected using a modified version of the 
Killer-Victim Scoreboard (KVS) method origi- 
nally developed by O’May and Heilman (2002) 
to collect static-feature data. The KVS method 
modifies OneSAF to provide critical battlespace 
data. A set of three files was generated during 
each simulation execution. These files include 
entity identification data, firer-target interaction 
data, and logistics and appearance data. Each run 
of the raw experimental data collected was then 
processed into five time series (by arranging them 
in the order of timestamps) reflecting the state of 
ongoing battle from the viewpoint of the “blue” 
force, which includes: 

• Relative territory ownership (denoted by g 
in figures) 

• Relative firepower strength (s) 

• Relative ammunition support (a) 

• Relative fuel support (f) 

• Relative firing intensity (i) 


Figure 4. Five time series of a sample battle simulation run. 
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Time-series data of 15 battle simulation runs 
were prepared for this study according to the 
outlined procedure. Note that these data were 
nonuniformly sampled as a result of the event- 
triggered data-collection mechanism. The result 
of a sample run is shown in Figure 4. 

We intentionally use only a few runs in this 
study because one often can afford only a limited 
number of simulation runs in actual applica- 
tions, due to time constraint. To enable real-time 
response, which is desirable in order to provide 
fast decision support, our analyses also attempt to 
use as few attributes and data points as possible. 
First, we applied the linear interpolation method 
to convert the original nonuniform series into 
uniform ones. Figure 5 shows the interpolated 
results of a battle simulation run. The uniform 
interval is consistently set at 100 seconds, which 
is relatively large. 

Comparing Figure 5 with Figure 4, it is ob- 
served that the overall trends are retained for all 
series, except that some high-frequency activities 
of the intensity series are lost. The empirical 
results indicate that this relatively large down 
sampled size is sufficient for the clustering task. 
Nevertheless, it might be desirable to determine 
the optimal down sampled size in the future by 


investigating the tradeoff between improved clus- 
tering (an unknown) and increased computational 
cost. Naturally, a better interpolation method 
might also exist for our data. To limit the scope 
of this study, we elect to address these issues in 
the future. 

Correlation analyses were performed on the 
five interpolated time series shown in Figure 5. 
It was discovered that series g and s are highly 
correlated (0.858); the same is true for f and a 
(0.771). These correlations generally apply to all 
simulation runs. Therefore, in the following, we 
will only discuss three indicators: g, a, and i. 

Figures 6, 7, and 8 show the 15 interpolated time 
series for each one of the three indicators: g, a, and 
i, respectively. In these figures, the series num- 
bered 0, 1, . . ., and 14 come from simulation run 
x0521, x4672, x0996, x5703, x7553, x3017, x0620, 
x8646, x4757, x0250, x2739, x6687, x5414, x7554, 
and x2514, respectively. Based on the clustering 
results of control-chart data and the cylinder-bell- 
funnel data, we chose the TWCV-based fitness 
function for both the fixed parameter GAs and 
AGAs. The DTW distance measure was chosen 
to handle these time series of unequal length. The 
maximum number of generations was kept at 30 
throughout. For better results, the population size 


Figure 5. Interpolated results of time series shown in Figure 4 
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of 20 and mutation rate of 0.01 were selected for 
the fixed parameter GAs, whereas the population 
size of 60 was chosen for the AGAs. For each GA, 
five repetitions were made. 

Tables 5-7 summarize the clustering results 
of territory ownership series, ammunition series, 


and firing intensity series, respectively, generated 
by both fixed parameter GAs and AGAs. In all 
tables, the medoids used by GA to form clusters 
are underlined. The best fitness value is shown in 
boldface for each indicator series. Based on these 
tables, the following observations can be made: 


Figure 6. Fifteen series of relative territory ownership of the blue forces 



Figure 7. Fifteen series of relative ammunition values of the blue forces 
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Figure 8. Fifteen series of relative fire intensity values of the blue forces 


Fifteen Series of Fire Intensity 

8 9 10 11 12 — 13 — •— 14 



1. The AGAs, in general, generate more con- 
sistent results (with the only exception in 
generating four clusters of territory own- 
ership series) and are able to find solutions 
with equivalent or higher best fitness than the 


fixed parameter GAs. Therefore, the results 
generated by AGAs are more trustworthy. 

2. As the number of clusters increases, the 
consistency decreases for both AGAs and 
fixed parameter GAs. 


Table 5. Clustering results of territory ownership series 


Method 

Number of clusters 

Repl. 

Best fitness 

Clusters 

FPGA 

2 

1-4 

3308602 

{0, 1, 4, 5, 8, 11, 12, 14} {2, 3, 6, 7, 9, 10, 13} 



5 

3224082 

{0, 1, 4, 5, 8, 11, 12, 14} {2, 3, 6, 7, 9, 10, 13} 


3 

1-2 

3907545 

{1, 5, 12}{0, 4, 6, 8, 9, 11, 14} { 2, 3, 7, 10, 13} 



3 

3948684 

{1, 5, 12 } { 0, 4, 8, 11, 14} {2, 3, 6, 7, 9, 10, 13} 



4 

3842933 

{1, 5, 12 } {0, 4, 8, 11, 14} { 2, 3, 6, 7, 9, 10, 13} 



5 

3884901 

{0, 1, 5, 8, 12}{2, 4, 6, 9, 11, 14}{3, 7, 10, 13} 


4 

1-3 

4365496 

{1, 5, 12}{0, 8, 11, 14} {2, 4, 6, 9}{3, 7, 10, 13} 



4 

4424450 

{1, 5, 12} { 0, 8, 11, 14} {2, 4, 6, 9}{3, 7,10, 13} 



5 

4462072 

{1, 5, 12}{0, 8, 11, 14} {2, 4, 6, 9}{3, 7, 10, 13} 

AGA 

2 

1-4 

3308602 

{0, 1, 4, 5, 8, 11, 12, 14} {2, 3, 6, 7, 9, 10, 13} 



5 

3229174 

{0, 1, 4, 5, 8, 11, 12, 14} {2, 3, 6, 7, 9, 10, 13} 


3 

1-3 

3978621 

{1, 5, 12 } {0, 4, 8, 11, 14} {2, 3, 6, 7, 9, 10, 13} 



4-5 

3948684 

{1, 5, 12}{0, 4, 8, 11, 14} {2, 3, 6, 7, 9, 10, 13} 


4 

1 

4386211 

{1, 5, 12 } {0, 8, 11, 14}{2, 4, 6, 9}{3, 7, 10, 13} 



2 

4259005 

{0}{1, 5, 8, 12 } {2, 4, 6, 9, 11, 14}{3, 7, 10, 13} 



3 

4294529 

{1, 5, 8, 12}{0, 11, 14} {2, 4, 6, 9}{3, 7, 10, 13} 



4 

4248009 

{0}{1, 5, 12}{4, 8, 11, 14}{2, 3, 6, 7, 9, 10, 13} 



5 

4462072 

{1, 5, 12 } {0, 8, 11, 14} {2, 4, 6, 9}{3, 7, 10, 13} 
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3. A different set of medoids could lead to the 
same clustering results. For instance, the 
fixed-parameter GA forms the same set of 


four clusters of territory ownership series 
based on three different sets of medoids (for 
replications 1-3, 4, and 5, respectively). 


Table 6. Clustering results of ammunition series 


Method 

Number of clusters 

Repl. 

Best fitness 

Clusters 

FPGA 

2 

1-5 

4505133 

{8}{0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14} 


3 

1-3 

4302312 

{8}{2}{0, 1, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14} 



4 

4297057 

{8}{2, 3, 4}{0, 1, 5, 6, 7, 9, 10, 11, 12, 13, 14} 



5 

4178824 

{5}{11}{0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 12, 13, 14} 


4 

1 

4086145 

{8}{2}{14}{0, 1, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13} 



2 

4081404 

{8}{2, 3, 4}{14}{0, 1, 5, 6, 7, 9, 10, 11, 12, 13} 



3 

4070924 

{8}{11}{5}{0, 1, 2, 3, 4, 6, 7, 9, 10, 12, 13, 14} 



4 

4029205 

{2}{3, 4}{11}{0, 1, 5, 6, 7, 8, 9, 10, 12, 13, 14} 



5 

4060230 

{8}{2, 3, 4}{7}{0, 1, 5, 6, 9, 10, 11, 12, 13, 14} 

AGA 

2 

1-5 

4505133 

{8}{0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14} 


3 

1-4 

4302312 

{8}{2}{0, 1, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14} 



5 

4297057 

{8}{2, 3, 4}{0, 1, 5, 6, 7, 9, 10, 11, 12, 13, 14} 


4 

1-3 

4107672 

{8}{2}{3, 4}{0, 1, 5, 6, 7, 9, 10, 11, 12, 13, 14} 



4 

4101704 

{8} { 2, 3, 4}{11}{0, 1, 5, 6, 7, 9, 10, 12, 13, 14} 



5 

4086145 

{8}{2}{14}{0, 1, 2, 3, 5, 6, 7, 9, 10, 11, 12, 13} 


Table 7. Clustering results of fire intensity series 


Method 

Number of clusters 

Repl. 

Best fitness 

Clusters 

FPGA 

2 

1 

3794554 

{0, 8, 11}{1, 2, 3, 4, 5, 7, 9, 10, 12, 13, 14} 



2-5 

3820013 

{0, 11, 14} { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13} 


3 

1 

3919665 

{6}{0, 11, 14} {1, 2, 3, 4, 5, 7, 8, 9, 10, 12, 13} 



2 

3892864 

{6}{0, 8, 11}{1, 2, 3, 4, 5, 7, 9, 10, 12, 13, 14} 



3 

3811803 

{6}{4, 14}{0, 1, 2, 3, 5, 7, 8, 9, 10, 11, 12, 13} 



4 

3870329 

{8}{0, 11, 14} {1, 2, 3, 4, 5, 6, 7, 9, 10, 12, 13} 



5 

3864623 

{6}{11, 14} {0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 12, 13} 


4 

1 

3956146 

{6}{11, 14} {0} { 1, 2, 3, 4, 5, 7, 8, 9, 10, 12, 13} 



2-3 

3969378 

{0, 8}{6}{11, 14} { 1, 2, 3, 4, 5, 7, 9, 10, 12, 13} 



4 

3900812 

{6}{4, 14} {0} { 1, 2, 3, 5, 7, 8, 9, 10, 11, 12, 13} 



5 

3938876 

{2}{0, 11, 14} {6} { 1, 3, 4, 5, 7, 8, 9, 10, 12, 13} 

AGA 

2 

1-5 

3820013 

{0, 11, 14} {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13} 


3 

1-5 

3919665 

{6}{0, 11, 14} {1, 2, 3, 4, 5, 7, 8, 9, 10, 12, 13} 


4 

1-4 

3972659 

{8}{6}{0, 11, 14} { 1, 2, 3, 4, 5, 7, 9, 10, 12, 13} 



5 

3956145 

{0}{6}{11, 14} { 1, 2, 3, 4, 5, 7, 8, 9, 10, 12, 13} 
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Since there is no ground truth for this data set, 
it is not easy to conclude which clustering result is 
better because the clustering accuracy cannot be 
computed for this particular data set. As a stan- 
dard practice in this situation, we resort to visual 
inspection for qualitative evaluation. Figures 9-11 


showthe four-cluster results generatedby the AGA 
for each indicator series, respectively. 

Note that in the previous three figures, the 
values along the vertical axis were changed from 
the original (for the series in three clusters) in order 
to separate one cluster from the others. Also note 


Figure 9. Four-cluster of territory ownership series generated by AGA 


4-Cluster of Territory Ownership Series 
Generated by AGA 



LT) O) CO I *HLOO)OOr *HLOO) 

Interpolated time point 



Figure 10. Four-cluster of ammunition series generated by AGA 


4-Cluster of Ammo Series Generated by AGA 
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Figure 11. Four-cluster of fire intensity series generated by AGA 



that the time -warping distance warps the time axis 
in comparing two series of unequal length. The 
No. 2 and No. 6 series of territory ownership were 
taken as examples to show how the time axis is 
warped to compare the two in Figure 12 . Since the 
warping is done on a pair-by-pair basis, it would 
take a lot of space to show all the warped pairs. 
To better visualize the similarity between any pair 
of two series, one must also exercise this warping 
ability, as demonstrated in Figure 12. 

Based on these results, we argue that the 
proposed genetic clustering method is effec- 
tive for grouping time-series data produced in 
battle-simulation experiments. The notable fea- 
ture of the proposed method is that it takes the 
entire battle sequence into consideration. Similar 
battle sequences, grouped in the same cluster, 
indicate that they are affected by the same set of 
mechanisms in play during the battle. Therefore, 
a model should be built for each cluster, to explain 
the mechanisms. Without using clustering as an 
exploratory step, one might build just one global 
model for all runs by assuming that there is only 
one mechanism in play. The predictions thus de- 
rived from this single global model are expected 
to be less accurate than from a set of local models 


developed for each cluster of data. The existence 
of more than one cluster for the same scenario in- 
dicates the nondeterministic nature of battles with 
several different mechanisms in play at the same 
time. This is exactly the knowledge discovered by 
exploratory data mining. The intriguing question 
is what tips one set of mechanisms from the other. 
Further investigation is necessary to answer this 
question. The study by Bodt etal. (Bodt, Forester, 
Hansen, Heilman, E., Kaste, & O’May, 2002) is 
a step in this direction, except that they did not 
consider the entire battle sequence. 

DISCUSSION 

To explain why one combination of fitness func- 
tion and distance measure performs better than 
another, the correlation between the best fitness 
value and the clustering accuracy was computed 
for the synthetic control-chart data, as given in 
Table 8. It is interesting to see that for both fixed- 
parameter GAs and AGAs, the best clustering 
results were generated by the combination having 
the highest positive correlation between the best 
fitness value and the clustering accuracy. The 
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Figure 12. An example of dynamic time warping 



negative correlations are undesirable because 
such fitness functions are counterproductive in 
the sense that maximizing the fitness by the GA 
actually leads to lower clustering accuracy. The 
low positive correlation indicates that the best 
fitness function used in this study is marginal, 
and a better one should be developed in order to 
achieve even higher clustering accuracy. This 
will be a topic for future studies. 

For comparison, the clustering results of k- 
means were also obtained for both the synthetic 
control-chart data and the cylinder-bell-funnel 
data. For each data set, five runs were made. The 
clustering accuracies obtained are 63.3%, 70%, 
60%, 56.7%, and 60% for the synthetic control- 
chart data, and are 87%, 87.3%, 87%, 87.3%, and 
87% for the cylinder-bell-funnel data. Therefore, 


genetic clustering methods could obtain higher 
accuracy than /(-means if an appropriate combi- 
nation of fitness function and distance measure 
is used. 


CONCLUSION 

This chapter presented an exploratory data-min- 
ing study of time-series data. To this end, genetic 
algorithms were developed for clustering the 
data. First, the effects of three fitness functions, 
two distance measures, and some selected GA 
parameters were investigated using the synthetic 
control-chart data that is available in the public 
domain. The results indicate that the two most 
significant factors are the fitness function and 


Table 8. Correlation between best fitness and accuracy 


Fitness function 

Distance measure 

Fixed parameter GAs 

AGAs 

TWCA 

Euclidean 

0.327 

0.264 

TWCA 

DTW 

-0.100 

0.267 

DB 

Euclidean 

-0.783 

-0.605 

DB 

DTW 

-0.458 

0.182 

Rm 

Euclidean 

-0.129 

-0.556 

Rm 

DTW 

-0.676 

-0.241 
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the interaction between the fitness function and 
the distance measure. It was also discovered that 
the best fitness value has a low positive correla- 
tion with the clustering accuracy. To increase the 
clustering accuracy, abetterfitness functionmust 
be developed, which poses both an opportunity 
and a challenge for researchers interested in this 
area of research. In addition, adaptive GAs did 
not always outperform fixed-parameter GAs. 
Better adaptive schemes are thus needed. Further 
testing on the cylinder-bell-funnel data confirms 
once again the significance of fitness function, 
distance measure, and their interaction. In addi- 
tion, it was found that dynamic time warping did 
not work as well as Euclidean for this dataset. The 
GA parameters such as population size, crossover 
rate, and mutation rate are relatively insignificant 
compared with fitness function and distance 
measure in the context of clustering. 

The proposed genetic clustering methods were 
also applied to time-series data acquired in battle 
simulation experiments. The potential use of the 
proposed methods in grouping similar battle 
profiles and separating dissimilar battle profiles 
are shown and discussed. As far as future study 
is concerned, this study can be further improved 
by investigating the following general topics: 

1. Develop a better fitness function. 

2. Develop a better GA parameter adaptation 
method. 

3. Try other chromosome coding schemes, 
such as integer coding rather than binary 
coding. 

4. Develop a viable validation index, either 
application dependent or application inde- 
pendent, to determine how many clusters 
are optimal. 

5. Extend the current method, which is cur- 
rently applicable only to univariate time-se- 
ries data, to multivariate time-series data. 

Specific topics for the battle simulation study 
in the future include: 


1. Complement this study with an in-depth 
analysis of the driving mechanisms/events 
that shape each group of battle sequences 
(or time-series profiles). 

2. Investigate the effect of the interpolation 
method and sampling interval in converting 
nonuniform data into uniform data. 

Finally, we shall point out that each time series 
was clustered individually in this study. For the 
battle simulation data, some multivariate time- 
series clustering methods such as hidden Markov 
models should be applicable. Alternatively, one 
can make use of a two-stage clustering procedure 
that was recently developed by Liao (2007), which 
involves converting multivariate time series into 
a univariate discrete time series in the first step 
and then applying the proposed genetic clustering 
algorithm in the second step. The results of the 
first stage have beenpresented in Liao etal. (Liao, 
Bodt, Forester, Hansen, Heilman, Kaste, & O’May, 
2002). The complete two-stage procedure and 
results will be presented in our future work. 
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ABSTRACT 

This chapter presents a hybrid approach that integrates a genetic algorithm (GA) and data mining to 
produce control signatures. The control signatures define the best parameter intervals leading to a de- 
sired outcome. This hybrid method integrates multiple rule sets generated by a data-mining algorithm 
with the fitness function of a GA. The solutions of the GA represent intersections among rules providing 
tight parameter bounds. The integration of intuitive rules provides an explanation for each generated 
control setting and it provides insights into the decision-making process. The ability to analyze parameter 
trends and the feasible solutions generated by the GA with respect to the outcomes is another benefit 
of the proposed hybrid method. The presented approach for deriving control signatures is applicable 
to various domains, such as energy, medical protocols, manufacturing, airline operations, customer 
service, and so on. Control signatures were developed and tested for control of a power plant boiler. 
These signatures discovered insightful relationships among parameters. The results and benefits of the 
proposed method for the power plant boiler are discussed in the chapter. 

INTRODUCTION to the increase in the volume of data, decision 

making in real time is becoming more difficult, 
Optimizing process controls is imperative in the and may potentially lead to inefficiencies and 
energy, medical, and service applications. Due hazardous situations. Intelligent control systems 
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have proven to be effective in optimizing com- 
plex processes. Merging computational concepts, 
such as neural networks, genetic algorithms, 
data mining, and fuzzy logic, can lead to robust 
controls (Krishnakumar & Goldberg, 1992; Lee, 
Perakis, Sevcik, Santoso, Lausterer, & Samad, 
2000). Though current intelligent approaches may 
improve operations, they provide limited insights 
into the decision-making process. 

This chapter describes a hybrid method that 
integrates data mining (Cios, Pedrycz, & Swin- 
iarski, 1998; Fayyad, Piatetsky- Shapiro, Smyth, 
Uthurusamy, 1995) and genetic algorithm (GA) 
(Goldberg, 1989; Holland, 1975; Lawrence, 1987; 
Michalewicz, 1992) concepts to define robust 
and explicit parameter set points. The hybrid ap- 
proach consists of partitioning data, developing 
classifiers for each data set, and combining and 
analyzing the classifiers (Mitra, Pal, & Mitra, 
2002). These steps lead to the development of 
control signatures (Kusiak, 2002) that define 
ranges of parameter settings, producing a desired 
outcome (decision). Control signatures are help- 
ful in learning the interactions and relationships 
between the parameters. 

Data mining is the process of discovering 
interesting and previously unknown patterns in 
data sets (Cios, Pedrycz, & Swiniarski, 1998; 
Fayyad, Piatetsky-Smapiro, Smyth, & Uthuru- 
samy, 1995). A typical data-mining algorithm, 
applied to partitioned data sets, generates multiple 
rule sets that describe parameter relationships. 
The GA provides a global search mechanism to 
discover the intersections among the decision 
rules. The intersections among decision rules 
could be analyzed through visualization (Ku- 
siak, 2001). As the number of rules increases, 
the graphical presentation and analysis becomes 
tedious. Furthermore, this kind of analysis does 
not provide the additional information leading the 
tighter parameter bounds, control signatures, and 
increased prediction accuracy. 

In this research, the process data is transformed 
into multiple knowledge bases that are used as 


the foundation of a GA fitness function. The 
GA mechanism strongly promotes the solutions 
that are in the feasible region formed by the rule 
intersections. It not only allows for exclusion of 
the less reliable (feasible) regions, but also defines 
the complex commonality among multiple rules. 
This in turn provides tighter bounds on various 
parameters. Complex nonlinear applications can 
be analyzed as the parameter relationships are 
preserved by the rule sets that are incorporated 
into the GA fitness function. 

The analysis of the control signatures across 
different outcomes provides information regard- 
ing the general parameter trends. Analyzing 
these trends and the feasible solutions generated 
by the GA is a way of visualizing the complex 
relationships and identifying key parameters. The 
proposed hybrid approach provides increased level 
of insight and was applied and tested on the data 
from a power plant boiler. The signatures defined 
ideal parameter ranges for boiler operations. 

METHOD 

This section describes a hybrid approach that 
integrates genetic algorithm (GA) and data min- 
ing to define control signatures (Figure 1). The 
use of GA is novel, due to the incorporation of 
data-mining output in the fitness function. Data 
mining defines relationships among parameters, 
while a standard GA provides the global search 
mechanism to identify ranges for parameter set- 
tings. The basic concepts of data mining and GA 
that are relevant to the development of control 
signatures are discussed next. 

The hybrid approach requires a data set with 
known outcomes, but it can handle continuous, 
discrete, and categorical data. The GAis facilitated 
by data preprocessing through the normalization 
of continuous parameters. Discrete and categori- 
cal parameters are assigned a value based on 
their probability of occurrence. All parameter 
values are transformed back to their respective 
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ranges at the termination of the GA. The data set 
with various parameters and known outcomes is 
used as a starting point in the analysis outlined 
in Figure 1. 

In order to increase the number of rule sets, 
the initial data set is partitioned into subsets of 
parameters and data records to form multiple 
data subsets (Figure 1) (Kusiak, Sham, & Dixon, 
2003). This allows the investigation of various 
regions of the solution space. The partial data 
sets can be formed by utilizing domain expert’s 
knowledge, or relevant knowledge from the lit- 
erature or at random. A data-mining algorithm is 
applied independently to each data subset, thus, 
producing several rule sets (classifiers). High 
cross-validation accuracy of the extracted rules 
increases the confidence in the derived control 
signatures. The significant rules extracted from 
these data sets form the GA fitness function. 
The diversity and redundancy of the rule sets is 
reflected in the fitness function as explained in 
detail in Section 2.3. 

The analysis of the GA feasible population 
can be conducted by various measures such as 
min, max, average, frequency distributions, best 
solution, and so forth. 


Figure 1. Computation of GA fitness function. 



Data-Mining Concepts 

Discovering hidden patterns in the data may 
represent valuable knowledge expressed in the 
form of control signatures. There are various 
data-mining algorithms available in the literature. 
Learning (classification) systems of potential 
interest to this research fall into 10 categories 
(Kusiak, 2001a): 

a. Classical statistical methods (e.g., linear 
discriminant, quadratic discriminant, and 
logistic discriminant analyses). 

b. Modern statistical techniques (e.g., projec- 
tion pursuit classification, density estima- 
tion, k-nearest neighbor, casual networks, 
Bayes theorem). 

c. Neural networks (e.g., backpropagation, 
linear vector quantifiers, and radial function 
networks). 

d. Support vector machines. 

e. Decision tree methods (e.g., C4.5, ID3, 
CN2). 

f. Decision rule algorithms (e.g., AQ15, LERS 
and numerous other algorithms based on the 
rough set theory). 

g. Association rule algorithms (e.g., DB2Intel- 
ligentMiner). 

h. Learning classifier systems (e.g., GOFFER, 
MonaLysa, and XCS). 

i. Inductive learning algorithms. 

j. Text learning algorithms. 

In this research, the learning algorithms 
of category (e) will be explored for two major 
reasons: 

• Generation of explicit knowledge in the form 

acceptable by a user. The user is able to 
understand the extracted knowledge, assess 
its usefulness, and learn new and interesting 
concepts. 
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• Controlled prediction accuracy. This charac- 
teristic is due to the nature of the decision- 
making approach applied in this research. 

The C4.5 decision tree algorithm (Quinlan, 
1986) used in this research produces rules in the 
following format: 

IF Boiler Master <= -0.53 AND Air Master > -1.5 
AND Air Fuel Ratio > 0.13 AND Avg Mid Temp 
> -0.42 AND Biomass Feed Rate > 0.4 THEN 
Interval = 88_90 [Rule strength = 0.857] 

Each rule includes a premise and conclusion, 
and is assigned strength. The rule strength de- 
scribes the percentage of observations in a given 
class (outcome) that match both the rule condition 
and the action. 

Multiple applications of the data-mining 
algorithm to various data subsets produce rules 
representing different solutions. Each of the solu- 
tions potentially describes a setting that results 
in the desired outcome. 


Genetic Algorithm 

A GA is a search procedure based on the concepts 
of natural genetics (Goldberg, 1989; Holland, 
1975; Lawrence, 1987; Michalewicz, 1992; Obitko, 
2004). It is initiated with a set of solutions (rep- 
resented by chromosomes) called the population. 
Each solution in the population is evaluated in 
terms of its fitness. Solutions chosen to form new 
chromosomes (offspring) are selected according 
to their fitness, that is, the more suitable they are, 
the higher likelihood they will reproduce. This 
is repeated until some stopping criterion (for ex- 
ample, the number of populations or improvement 
of the best solution) is satisfied. GA operators and 
solution encoding scheme need to be carefully 
chosen to appropriately explore and exploit the 
solution space (Pham & Karaboga, 2000). 

The steps of the GA algorithm are outlined 
in Figure 2 (Goldberg, 1989). First, the solution- 
encoding scheme is designed and the problem 
specific fitness function is formulated. An initial 
(random) population of n chromosomes (solutions) 


Figure 2. Genetic algorithm 
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is generated, and these solutions are evaluated 
based on the value of their scaled fitness. Depend- 
ing upon the fitness function value some solutions 
from the old population are chosen for mating 
using a selection scheme, for example, roulette 
wheel or tournament selection. The crossover 
operator is applied to the selected parents. Gener- 
ally, a single point crossover is recommend if the 
chromosome size is small (Goldberg, 1989; Pham 
& Karaboga, 2000). Crossover is performed until 
m new offspring are generated, where m is equal 
to n of the old population. 

There is a possibility that the initial population 
was created in such a way that the GA cannot 
generate some solutions of interest. To avoid this 
and make the GA more robust, a mutation operator 
is used. Mutation operator alters the value at some 
genes using the mutation probability (Goldberg, 
1989; Pham & Karaboga, 2000). This allows the 
exploration of the previously unattainable search 
space. After crossover and mutation are carried 
out, a new population is generated. The fitness of 
the new solutions is evaluated to measure their 
goodness. If the GA termination condition is not 
satisfied, then the newpopulationbecomes the old 
population and all the steps are repeated. 

The hybrid approach presented in this research 
employs a single-point standard crossover op- 
erator, varying mutation operator, and standard 
tournament selection operator. These operators 
ensure a balance between exploration and ex- 


ploitation for the GA search mechanism. The 
encoding scheme consists of a gene representing 
each parameter. Thus, the length of the chromo- 
some is equal to the number of parameters. The 
phenotype representations of the chromosome are 
the actual values. The genotype representation is 
a z-transform of the original values. The encod- 
ing scheme for categorical parameters is shown 
in Appendix A. 

Fitness Function 

The fitness function is computed by matching 
the rule sets generated from mining data subsets 
(Figure 1). The rules define a set of feasible regions 
for each parameter, which provides the founda- 
tion of the control signature. The fitness function 
locates the rule set intersection, thus, optimizing 
the multirule set fitness function. For example, 
consider the following three rules: 

• Rule 1: IF A < 1.2 AND B = 4 AND C = S 
THEN Decision = 1 

• Rule 2: IF A > 0.7 AND D = 6 THEN Deci- 
sion = 1 

• Rule 3: IF A < 1.5 AND E = 8 AND C = S 
THEN Decision = 1 

The range of values of parameter A satisfying 
all the three rules is graphically represented in 
Figure 3. The rule set intersection for parameter 


Figure 3. Minimum rule set intersection for parameter A 
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A is between 0.8 and 1.2. The fitness function 
is evaluated for the rule set intersections of all 
parameters involved in forming the control 
signature. The fitness functions discussed next 
involve three decision intervals. These functions 
could be extended to include as many intervals 
as required. 

A fitness function depends on the nature of 
the problem at hand. The first type of fitness 
function is the rule match (GARM). The goal 
is to match as many rules as possible with the 
desired GA interval (outcome) (see equations 
1 - 6). Thus, if the GA solution matches a rule 
in a given interval, then it is awarded score = 1, 
else 0. The GA solution can match various rules 
from different intervals, and the scores for each 
interval are computed separately. Depending 
upon the GA optimization interval, the scores 
of other intervals are subtracted from the score 
of the desired GA interval. This guarantees that 
there is higher incentive for matching the GA 
desired interval and penalizing the matching 
rules of other intervals. GARM also tries to pro- 
mote GA solutions with minimum contradiction 
within intervals. A high penalty is assigned for a 
GA solution not satisfying a single rule from any 
interval. The ultimate objective is to maximize 
the number of rules matched. 

Score_countl = ^ Number of rules matched 
i: Rule set - Interval 1 (1) 

Score_count2 = Y Number of rules matched 
i: Rule set - Interval 2 (2) 

Score_count3 = ^ Number of rules matched 
i: Rule set - Interval 3 (3) 

Score_count = Score_countl - Score_count2 

- Score_count3 for Interval 1 (4) 

Score_count = Score_count2 - Score_countl 

- Score_count3 for Interval 2 (5) 


Score_count = Score_count3 - Score_countl 

- Score_count2 for Interval 3 (6) 

The GARM fitness function maximizes the 
number of rules matched without due consider- 
ation to the strength of the rules. The weighted 
score fitness function (GAWS) weights the 
matched rules (see equations 7 - 12). Thus, if the 
GA solution matches a rule in a given interval, then 
it is awarded score = one*(scaled rule strength), 
else 0. The individualized scores are calculated 
for each interval and, depending upon the desired 
GA optimizing interval, a similar approach to the 
GARM is incorporated. 

Score_weightedl = T^ Rules matched „ Rule 
strength 1 ‘ 

i : Rule set - Interval 1 j: Rule number in a set 

- Interval 1 (7) 

Score_weighted2 = ^^Rules matched , Rule 

strength ' j 

i: Rule set - Interval 2 j : Rule number in a set 

- Interval 2 (8) 

Score_weighted3 = YY Rules matched „ Rule 
strength ' 1 

i : Rule set - Interval 3 j: Rule number in a set 

- Interval 3 (9) 

Score_weighted = Score_weightedl - Score_ 
weighted2 - Score_weighted3 for Interval 1 

(10) 

Score_weighted = Score_weighted2 - Score_ 
weightedl - Score_weighted3 for Interval 2 

( 11 ) 

Score_weighted = Score_weighted3 - Score_ 
weightedl - Score_weighted2 for Interval 3 

( 12 ) 

The third fitness function, weighted scores and 
rules (GAWSR), considers the rule strength and the 
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number of rules matched by the GA solution (see 
equations 13 - 18). This ensures that the strongest 
rules do not dominate the GA population. Here, the 
same procedure as the GAWS is adopted, except 
the number of matched rules is multiplied by the 
GAWS score for each respective solution. Thus, 
an effort is made by the GA search to satisfy as 
many rules as possible for each rule set. 

Scorel = X Number of rules matched Rules 
i j 

matched „ Rule strength) 

(z: Rule set number and j: Rule number in a set) 

- Interval 1 (13) 

Score2 = ^ Number of rule matched (X Rules 

i j 

matched „ Rule strength) 

(z: Rule set number and j: Rule number in a set) 

- Interval 2 (14) 

Score3 = ^ Number of rules matched (X Rules 

i j 

matched , Rule strength) 

(z: Rule set number and j: Rule number in a set) 


- Interval 3 (15) 

Score = Scorel - Score2 - Score3 for In- 
terval 1 (16) 

Score = Score2 - Scorel - Score3 for In- 
terval 2 (17) 

Score = Score3 - Scorel - Score2 for In- 
terval 3 (18) 


The domain knowledge is incorporated in 
the fitness function as a penalty for violating the 
domain rules. These domain rules can be in the 
form of illegal parameter values, prohibited com- 
binations of parameters, and so forth. Improved 
knowledge-based fitness functions enhance both 
the ability of GA search mechanism and control 
intervals. The performance of the GA depends 
on the complexity of the problem and the type of 
fitness function. Thus, the complexity of GAWSR 
fitness function may lead to lengthy GA runs. 


Control Signature Development 

The generation of control signatures requires run- 
ning the GA several times due to maximi z ation 
of each of the three fitness functions at least once 
for each outcome. The maximization of each fit- 
ness function ensures that a large solution space 
is explored. Repeating the GA n times for each 
fitness function further enhances the exploration 
and exploitation of the solution space. The next 
step in developing control signatures is running 
the GA for each interval. This allows for the 
development of ideal (desirable) parameter set 
points across all intervals of interest. Operating 
at the ideal parameter settings will improve the 
confidence of achieving the desired outcomes. 

The analysis of the trends, graphically and 
statistically between various outcomes, increases 
insight in the process expressed in the data. Each 
feasible setting generated by the GA is stored in a 
database. The volume of data is sufficiently large 
to develop histograms as well as to perform other 
statistical analyses. Plotting the histogram for 
each parameter provides more insights into the 
behavior of the parameters (i.e., critical process 
control parameters). Critical parameters may be 
indicated by unique histograms for each outcome. 
These histograms may display narrow operating 
conditions related to the desired outcome. Non- 
critical parameters may have similar histograms 
for each outcome. This indicates that regardless 
of outcome, the possible settings obtained through 
the GA are the same for that parameter. 

Table 1 shows the parameters values that 
maximize each of the three fitness functions. In 
this example, each fitness function was maximized 
twice at every interval. The final step in deriving 
control signatures is determining the minimum, 
maximum, and average for each parameter over all 
intervals. Confidence intervals and other statistics 
could also be used. The application will dictate 
the appropriate tool. Example control signatures 
were computed using the data in Table 1 for both 
parameters, as shown in Table 2. The minimum 
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and maximum values define the range of accept- 
able values for each parameter. 

The next section presents the development 
of control signatures for a power plant appli- 
cation. These control signatures are used to 
describe ideal parameter settings for several 
boiler efficiency intervals. 

INDUSTRIAL APPLICATION 

Reduction of production costs and waste is critical 
to operations of power plants. Currently, opera- 


tors are required to make decisions regarding the 
operation of the boiler system based on multiple 
pieces of information. This information is complex 
and changes in real time. Control signatures will 
increase efficiency, lower operating costs, reduce 
fuel consumption, and decrease emission of air 
pollutants. Some of the model-based controllers 
published in the literature are briefly discussed 
next. 

The topic of intelligent control in power 
system applications is an active area of re- 
search. 


Table 1. Example output from the genetic algorithm 


A. Interval 1 

Fitness function 

Run 

Parameter 1 

Parameter 2 

Matched rules (GARM) 

GARM 1 

2.6 

10.1 

GARM 2 

3.1 

11.8 

Weighted score (GAWS) 

GAWS1 

2.7 

9.3 

GAWS 2 

2.8 

9.8 

Rules and weighted 
score (GAWSR) 

GAWSR1 

3.2 

11.6 

GAWSR2 

2.6 

11.5 

B. Interval 2 

Fitness function 

Run 

Parameter 1 

Parameter 2 

Matched rules (GARM) 

GARM 1 

4.9 

22.6 

GARM 2 

5.1 

23 

Weighted score (GAWS) 

GAWS1 

4.8 

24.1 

GAWS 2 

4.6 

24 

Rules and weighted 
score (GAWSR) 

GAWSR1 

6.2 

22.8 

GAWSR2 

6.1 

22.9 


Table 2. Control signatures for parameters 1 and 2 


A. Control signature for parameter 1 

B. Control signature for parameter 2 

Range 

Interval 1 

Interval 2 


Range 

Interval 1 

Interval 2 

Min 

2.6 

4.6 


Min 

9.3 

22.6 

Average 

2.8 

5.3 


Average 

10.7 

23.2 

Max 

3.2 

6.2 


Max 

11.8 

24.1 
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Chong et al. (Chong, Wilcox, & Ward, 2000) 
applied a neural network to represent the forma- 
tion of pollutant emissions resulting from the 
combustion of coal. The resultant “black-box” 
models of the pollutant emissions, namely the 
nitrogen oxides and carbon monoxide emissions, 
represented the dynamics of the process and de- 
livered reasonably accurate estimates over a wide 
range of test data. The authors pointed out that the 
neural network model, although lacking in model 
transparency produce estimates of the derivatives 
of combustion with acceptable accuracy, relative 
to the simplicity of the model design. 

Ghezelayagh and Lee (2002) proposed an intel- 
ligent predictive controller for a fossil fuel power 
unit. This controller was based on a self-organized 
neurofuzzy identifier to predict the response of 
the plant. The control inputs were optimized 
by an evolutionary programming algorithm to 
minimize the error of the identifier outputs and 
reference set points. 

Stephan et al. (Stephan, Debes, Gross, Win- 
trich, & Wintrich, 2001) presented a control 
scheme based on reinforcement learning for 
an industrial hard-coal combustion process in 
a power plant. They minimized the nitrogen 
oxides emission to comply with the tightening 
environmental protection requirements, while 
keeping other process parameters within speci- 
fied limits. They demonstrated that the proposed 
multiagent system significantly reduced the over- 
all air consumption of the combustion process at 
the power plant. 

Booth and Roland (1991) applied a neural 
network-based system to several types of combus- 
tion facilities. The neural network-based system 
optimized the boiler operation by accommodating 
equipment performance changes due to wear and 
maintenance activities, adjusting to fluctuations in 
fuel quality, and improving operating flexibility. 
The system dynamically adjusted combustion set 


points and bias settings in closed-loop supervisory 
control to reduce NOx emissions and improve 
heat rate simultaneously. 

Li, Thompson, and Peng, (2002) presented a 
method for building an artificial neural network 
(ANN) based on a GA. The proposed ANN was 
used to predict N0 2 emissions in a coal-burning 
power plant. This hybrid approach was effective 
in building an ANN model that was highly ac- 
curate in predicting N0 2 emissions. 

Genetic algorithms have also been applied to 
optimize power flow problems (Bakirtzis, Biskas, 
Zoumas, & Petridis, 2002), design boiler system 
controls (Dimeo & Lee, 1995) and generation 
expansion (Park, Park, Won, & Lee, 2000) with 
varying degrees of success. 

The approaches published in the literature 
may provide improved control settings, however, 
the decision-making process is indiscernible or 
a “black box.” The hybrid approach presented 
in this research is novel due to the utilization of 
explicit rules. 

To demonstrate the methods outlined in 
the chapter, control signatures were developed 
for a circulating fluidized boiler (CFB) at the 
University of Iowa Power Plant. These control 
signatures define ideal parameter values for 
three boiler efficiency intervals (84-86, 86-88, 
and 88-90) measured in percentages (Table 4). 
These three intervals were selected due to the 
fact they comprised a significant portion of the 
observations. There was not sufficient data to 
define control signatures for the higher efficiency 
zones (i.e., efficiencies greater than 90%). This 
could be attributed to the quality of the coal used 
for combustion, the environmental conditions, 
the skill and consistency of the operators, and so 
on. As the efficiency of the process increases, the 
higher efficiency data will be collected and used 
to develop future control signatures. 
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For this research, efficiency was defined 
as the theoretical energy output divided by 
the input (equation (19)). 


Efficiency (%) 


Ihfi * (^steam* ^ steam) ^ (®coal coal) 

(19) 


where: 


coal 


coal 

into boiler 


H 


steam 


n f 

steam 

boiler 


Average btu/lb of fuel 
Total number of lbs of fuel fed 

Enthalpy of steam 

Total lbs of steam produced by 


The average btu/lb of fuel was obtained from 
historical data and was used as a constant for the 
computation of the efficiency. 

Data on 13 parameters selected by a domain 
expert was collected in 1-minute intervals. In 


this research, a dataset of over 10,000 observa- 
tions was considered. The list of the parameters 
is shown in Table 3. 

During the preprocessing phase, all observa- 
tions collected during the daily calibration of the 
instruments and gauges were removed. Based on 
computational experimentation, the original data 
was transformed using a 20 -minute moving aver- 
age. This reduced the noise of the parameters and 
improved prediction accuracy of knowledge pro- 
duced by the data-mining algorithm. Experiments 
have also led to the discretization of the efficiency 
into to 2% intervals (see Table 4). Discretizing 
efficiency into crisp intervals will likely increase 
the error rate of the classifier and result in some 
reduction of information. That is to say, there is a 
small difference between 85.95% and 86.01%, but 
these efficiencies will be labeled into two distinct 
intervals (i.e., 84-86 and 86-88). Utilizing fuzzy 
logic could overcome this concern. This research 


Table 3. List of parameters 


Parameter List 

Boiler Master 

Average Middle Bed 

Air Master 

Average Lower Bed 

Biomass Feed Rate 

Bed Pressure Median 

Furnace Draft 

Air Fuel Ratio 

Average O n 

Ratio SA / PA 

SAFan Flow 

DTemp 

PA Fan Flow 



Table 4. Discretized values of efficiency 


Category 

Efficiency 

Category 

Efficiency 

L_80 

Efficiency < 80% 

90_92 

90% < Efficiency < 92% 

80„82 

80% < Efficiency < 82% 

92_94 

92% < Efficiency < 94% 

82_84 

82% < Efficiency < 84% 

94_96 

94% < Efficiency < 96% 

84_86 

84% < Efficiency < 86% 

96_98 

96% < Efficiency < 98% 

86„88 

86% < Efficiency < 88% 

G_98 

98% > Efficiency 

88_90 

88% < Efficiency < 90% 
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focuses on defining control settings leading to a 
relative improvement ofboiler efficiency, making 
the potential loss of information acceptable. The 
proposed interval approach is also supported by a 
measurement error of some parameters that could 
be significant and varying in time. 

Data Mining 

To generate multiple rule sets the initial data set 
was partitioned into two data subsets (Figure 1). 
Each data subset consisted of at least 2,500 obser- 
vations. To further increase the number of rule sets, 


different partial-parameter subsets were created 
for each data subset. Each data subset contained 
a unique combination of parameters. This cre- 
ated seven knowledge bases for the data-mining 
algorithm (C4.5, Quinlan, 1986) to explore. At least 
eight parameters were used in every application 
of the algorithm and all applications resulted in 
a 10-fold cross validation (Stone, 1974) with clas- 
sification accuracy greater than 90%. 

The 10 rules with the highest strength (Mitra, 
Pal, & Mitra, 2002) that described the three ef- 
ficiency intervals were selected from each rule 
set. The selected rules were then incorporated 


Figure 4. Coding scheme 


Bit position 

B1 

B2 

B3 


Bll 

B12 

B13 

Features 

Boiler 

Master 

Air 

Master 

Air Fuel 
Ratio 


Average0 2 

SA Fan 
Flow 

Biomass 

Feed Rate 

Phenotype 

109.80 

102.97 

10.28 


5.39 

57.01 

13.27 

Genotype 

2.74 

2.00 

-1.12 


2.66 

0.21 

2.93 


Figure 5. GA input screen 
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into the fitness functions discussed in Section 2.7. 
This amounted to 70 rules that described each of 
the three intervals. In this application, scorel is 
equivalent to the efficiency interval 84-86, score2 
is equal to interval 86-88, and score3 coincides 
with the interval 88-90. 

All the parameters consisted of continuous real 
values and were standardized using the z-trans- 
formation. A sample of the 13-bit chromosome 
is shown in Figure 4. 

Control Signature Generation 

Standard settings were applied at each iteration 
of the GA. The GA used a single-point crossover 
with a probability of 0.9 and an initial mutation rate 
of 0.1. The mutation rate was set to incrementally 
increase up to 0.4 after 10 generations of obtaining 
the same solution. The population size was set to 
500, and a tournament selection of 25 observations 
was employed to select chromosomes for breeding. 
The user interface is shown in Figure 5. 

To identify control signatures, the GA was run 
nine times for each efficiency interval. The nine 
trials were comprised of three replications for 


Table 6. Control signature for Average O, 


Average 0 2 


Interval 


84-86 

86-88 

88-90 

Min 

4.530 

3.639 

3.567 

Average 

4.860 

4.192 

3.825 

Max 

5.484 

4.420 

4.084 


each of the three unique fitness functions. Table 
5 displays the results from the nine GA trials for 
the parameter, Average 0 2 . 

The control signature was determined by 
computing the minimum, maximum, and average 
for each efficiency interval. Table 6 presents the 
control signature obtained for the Average 0 2 . 
Control signatures were obtained for all of the 
13 parameters. 

Computational Results 

The development control signatures identify 
trends in parameter settings that contribute to 


Table 5. GA runs results for average O, 


Average 0 2 




Interval 

Fitness function 

Trial 

84-86 

86-88 

88-90 


GARM 1 

5.386 

4.290 

4.084 

Matched rules (GARM) 

GARM 2 

5.436 

4.295 

3.797 


GARM 3 

5.484 

4.404 

3.751 


GAWS 1 

4.550 

4.310 

3.606 

Weighted score (GAWS) 

GAWS 2 

4.530 

4.420 

3.788 


GAWS 3 

4.550 

4.420 

4.070 


GAWSR 1 

4.641 

3.639 

4.047 

Rules and weighted score (GAWSR) 

GAWSR 2 

4.550 

4.290 

3.711 


GAWSR 3 

4.610 

3.664 

3.567 
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Figure 6. DTemp control signature 



Figure 7. Average 0 2 control signature 
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improved understanding of the boiler process. 
Figure 6 depicts the control signature settings for 
the parameter DTemp (difference in temperature 
between the middle and lower boiler chambers). 
The error bars represent the min and max values 
defined by the control signatures. The graph 
clearly shows a trend that the DTemp should de- 
crease to obtain increased efficiencies, ft is also 
evident that as the efficiency interval increases 
the range of acceptable settings for DTemp also 
increases. 

The graph illustrating the Average 0 2 control 
signature is shown in Figure 7. The graph indi- 
cates that as the setting of Average 0 2 decreases 
the efficiency interval increases. 

The parameter settings defined by the control 
signature coupled with the insights gained from 
the analysis of the parameter trends lead to the 
increased boiler efficiency. To gain further under- 
standing into the parameter settings, histograms 


were developed for all feasible solutions obtained 
by the GA. A feasible solution is defined as an 
observation that matches at least one rule for the 
desired efficiency interval and does not match any 
rules related to other intervals. A histogram was 
then constructed on all the feasible solutions for 
a given efficiency interval. The three histograms 
(Figure 8, Figure 9, and Figure 10) for the param- 
eter Air Fuel Ratio are shown next. 

The histograms in Figure 8 through 10 dem- 
onstrate that each interval has a clear, unique, 
and identifiable peak value, where the majority 
of feasible solutions occur. The identified peaks 
led to the development of tighter bounds for the 
Air Fuel Ratio. The bounds obtained by the histo- 
grams define three mutually exclusive operating 
intervals (Figure 11). Operating in these intervals 
increases confidence that the desired efficiency 
interval will be obtained. 


Figure 8. Air Fuel Ratio histogram: Interval 84-86 


Air Fuel Ratio: 84-86 



Air Fuel Ratio 
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Figure 9. Air Fuel Ratio histogram: Interval 86-88 



Figure 10. Air Fuel Ratio histogram: Interval 88-90 
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Figure 11. Control signatures comparison for Air Fuel Ratio 


A. Control signature obtained from GA 


B. Control signature obtained from histograms 

Air Fuel Ratio 


Air Fuel Ratio 


Interval 



Interval 


84-86 

86-88 

88-90 



84-86 

86-88 

88-90 

Min 

10.28 

10.33 

10.78 


Min 

10.28 

10.54 

11.17 

Max 

10.73 

10.65 

11.25 


Max 

10.31 

10.58 

11.73 


The histograms (Figure 12, Figure 13, Figure 
14) for the Furnace Draft are shown next. The 
parameter illustrated in these histograms provides 
another interesting insight. 

Each of the three histograms (Figure 12 - 14) 
demonstrates a unique distribution, however, as all 
share interesting similarities. Each histogram has 
approximately the same lower and upper bound, 
-0.92 to -0.12, respectively. This fact indicates that 
this particular parameter is robust to changes in 


all three efficiency intervals. Furnace Draft can 
be set to any value within the specified range and 
be left static regardless of the desired efficiency 
interval. This is significant because parameters 
similar to the Furnace Draft require less control 
and may result in savings in data collection and 
storage. 

The three histograms for the parameter Bed 
Pressure Median are illustrated in Figure 15, Fig- 
ure 16, and Figure 17. These figures represent a 


Figure 12. Furnace Draft histogram: Interval 84-86 
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Figure 13. Furnace Draft histogram: Interval 86-88 


Furnace Draft: 86-88 



Furnace Draft 


Figure 14. Furnace Draft histogram: Interval 88-90 



195 


Development of Control Signatures with a Hybrid Data Mining and Genetic Algorithm 


Figure 15. Bed Pressure Median histogram: Interval 84-86 



Figure 16. Bed Pressure Median histogram: Interval 86-88 


Bed Pressure Median: 86-88 



Bed Pressure Median 
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Figure 17. Bed Pressure Median histogram: Interval 84-86 


Bed Pressure Median: 88-90 















Bed Pressure Median 


third type of parameter action within the feasible 
solution space. 

The histograms (Figure 15 - 17) for Bed 
Pressure Median show a parameter that reacts 
differently depending on the efficiency interval. 
In Figure 15, the space of feasible solutions is 
robust, and any setting greater than 14.5 results 
in the 84-86 efficiency interval. The interval 86- 
88 (Figure 16) shows a bimodal distribution with 
peak values at 19 and 24.5. These two peaks will 
more likely yield efficiency in the 86-88 interval. 
The final histogram (Figure 17) demonstrates a 
single definitive peak similar to the histograms for 
the parameter Air Fuel Ratio (Figure 8 through 
Figure 10). 

The three parameters (Air Fuel Ratio, Furnace 
Draft, and Bed Pressure Median) demonstrate 
the three main pattern types discovered in this 
research. Mutually exclusive operating intervals 
were seen in the histograms for the parameter Air 


Fuel Ratio. These types of operating conditions 
are desirable due to the fact that at each interval, 
there are distinct and tight operating ranges. The 
second pattern was demonstrated with the param- 
eter Furnace Draft. The histograms developed 
from that parameter display large robust operat- 
ing ranges that remained constant between the 
efficiency intervals. This can be advantageous 
because it allows for selection of any setting 
within the operating range. The last pattern can 
be seen in the Bed Pressure Median histograms. 
The histogram for each efficiency interval has a 
different pattern. These types of parameters may 
require further investigation. 

Testing 

The testing accuracies are different then cross- 
validation/data-mining accuracies. They are based 
on different test data sets that are obtained without 


197 


Development of Control Signatures with a Hybrid Data Mining and Genetic Algorithm 


the application of control signatures for process 
controls. The accuracy of the control signatures 
was tested by acquiring data from several days 
following the initial data collection. The test data 
set consisted of over 5,000 observations and was 
preprocessed in the same way as the original data 
set. All the individual parameter control signa- 
tures were combined to develop a master control 
signature for each efficiency interval (Table 7). 

Using the logic functions in Excel, all obser- 
vations that fell within the ranges defined by the 


control signature were extracted. The percentage 
of the extracted observations with the efficiency 
interval defined by the control signature was 
considered correct and defines the accuracy for 
each control signature (see equation 20). 

Accuracy = 100% ^Number of correct predictions 
/ Total number of predictions (20) 

The number of observations in the test data set 
that match the control signatures ranges is small; 


Table 7. Partial control signature for the efficiency interval: 84-86 



Air Fuel Ratio 

Avg Mid Temp 

Avg Lower Bed 

DTemp 

PA Fan Flow 

Min 

9.6 

1521.3 

1537.3 

-31.5 

69.6 

Average 

10.6 

1565.1 

1585.3 

-20.2 

88.3 

Max 

12.1 

1606.4 

1626.8 

16.0 

106.7 


Figure 18. Comparison of control signature accuracy 


Comparison of control signature accuracy 



□ Statistical signature 
■ GA signature 

□ His to gram s ignature 


Efficiency interval 
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however, if the observations are within the ranges, 
then the accuracy for identifying the appropriate 
outcome is high. Due to the tight control signature 
ranges defined by the GA, the constraints were 
relaxed in order to increase the number of predic- 
tions. The signatures were relaxed by reducing 
the number of parameters used in making the 
predictions. The relaxed signatures used between 
4 and 8 of the 13 parameters. The accuracy was 
computed for the three efficiency intervals (84-86, 
86-88, and 88-90) of interest. 

The accuracy was also computed for control 
signatures derived from a classical statistical 
method. The statistical control signature was 
obtained by calculating the min, max, and average 
for each parameter for the three intervals. These 
values were computed using the original boiler 
data set. The accuracy was also calculated from 
the histograms analysis. The results are shown 
in Figure 18. 

The results demonstrate that the histogram 
signature has the highest accuracy regardless 
of the efficiency interval. The statistical control 
signature has the worst performance except for 
the 84-86 interval. The relaxing of the GA control 
signatures may account for a reduction of ac- 
curacy. The true GA control signature accuracy 
should be higher than the relaxed accuracy. 

The results demonstrate that the tighter bounds 
defined through the GA and histogram analysis 
are, on average, more accurate (see Table 8) than 
the statistical control signature. 


Table 8. Average accuracy of control signatures 


Control signature 

Average accuracy 

Statistical signature 

31% 

GA signature 

64% 

Histogram signature 

83% 


CONCLUSION 

A hybrid method combining genetic algorithm and 
data mining for deriving control signatures was 
proposed and successfully applied to improve the 
operations of a power plant boiler. The genetic al- 
gorithm evaluated a multiobjective fitness function 
comprised of rules obtained from several partial 
data sets. A control signature defines the feasible 
region marked by the rule intersections, thus pro- 
viding tighter bounds for operating parameters. 
Operating at these ideal parameter settings will 
improve the confidence of achieving the desired 
outcomes. The hybrid method has the capability 
of visualizing the complex relationships through 
the analysis of trends and the feasible solutions 
generated by the genetic algorithm. 

The application of the control signatures to the 
power plant data has led to the development of 
robust and at times mutually exclusive parameter 
settings. The hybrid method provided valuable in- 
sight regarding the Air Fuel Ratio, Furnace Draft, 
and so on, with respect to the different efficiency 
intervals. The control signatures reduced the pa- 
rameter ranges by 53% (i.e., tightening the bounds) 
while increasing the testing accuracy by 52% 
over the standard statistical method. The control 
signatures increased boiler efficiency, resulting 
in decreased operating costs and the reduction in 
fuel consumption. The method outlined can be 
applied to diverse domains with similar results. 
The research reported in this chapter contributes 
to the development of real-time intelligent control 
systems. 
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APPENDIX A 

Solution-Encoding Scheme 

The GA solution-encoding scheme used in this method arranges all the numeric parameters at the 
beginning of a chromosome followed by the categorical parameters (Figure Al). Thus, each chromo- 
some is split into two parts, namely numeric section (N-section) and categorical section (C-section). 
Each gene position represents a parameter making the length of the chromosome equal to the number 
of parameters. The phenotype representation of the N-section and C-section chromosome are the actual 
values for respective parameters. The genotype representation for N-section is a z-transform of the 
original value, while for C-section, it is based on a probability values. The genotype representation is 
used in the GA. 

Proposed Crossover and Mutation Operators 

For numerical parameters, Z-transform was used to standardize the parameters. The standardization 
procedure eliminated the need to develop special GA operators, as crossover (single/multiple point) 
and mutation operation can be performed without worrying about maintaining the feasibility of the 
parameter values. For categorical parameters, they are placed at the end of the coding scheme, and two 
phase crossover and individualized mutation operators were used. 

In the first phase, standard crossover and mutation with respect to the numerical parameters is 
performed (Table Al, Figure A2, and Table A2). In the second phase, individualized mutation opera- 
tors, with respect to each categorical parameter, are carried out (Table Al, Figure A3, and Table A2). 
This scheme requires one crossover probability for N-section (0.9), and different mutation probabilities 
for N-section (0.1) and the C-section (0.6), respectively. The C-section mutation operator ensures the 
feasibility of each parameter and is set at a higher rate to facilitate the faster exploration process. 


Figure Al. Coding scheme 



Numeric parameter 

Categorical parameter 

Position 

N1 

N2 

N3 

N4 

Cl 

C2 

C3 

Parameters 

Boiler Master 

Air Master 

Air Fuel Ratio 

Avg Mid Temp 

Test 1 

Test 2 

Test 3 

Phenotype 

109.803 

102.971 

10.28 

1524.029 

A 

C 

T 

Genotype 

2.737 

1.996 

-1.116 

-2.212 

A 

C 

T 


Table Al. Parent chromosomes 



N1 

N2 

N3 

N4 

N5 

N6 

Cl 

C2 

C3 

C4 

Chromosome 1 

0.23 

0.58 

1.25 

-2.57 

0.89 

1.25 

A 

C 

T 

C 

Chromosome 2 

0.58 

0.92 

1.75 

0.01 

-0.75 

2.25 

T 

T 

A 

C 
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Figure A2. N-section crossover and mutation 


Phase I 

Crossover 

Parameter 

N1 

N2 

N3 

N4 

N5 

N6 

Chromosome 1 

0.23 

0.58 

1.25 

-2.57 

0.89 

1.25 

Chromosome 2 

0.58 

0.92 

1.75 

0.01 

-0.75 

2.25 

Post crossover 

Chromosome 1 

0.23 

0.58 

1.25 

0.01 

-0.75 

2.25 

Chromosome 2 

0.58 

0.92 

1.75 

-2.57 

0.89 

1.25 

Mutation 

Parameter 

N1 

N2 

N3 

N4 

N5 

N6 

Chromosome 1 

0.23 

0.58 

1.25 

0.01 

-0.75 

2.25 

Chromosome 2 

0.58 

0.92 

1.75 

-2.57 

0.89 

1.25 

Post mutation 

Chromosome 1 

0.23 

0.58 

1.25 

1.45 

-0.75 

2.25 

Chromosome 2 

0.58 

0.92 

1.75 

-2.57 

0.89 

1.25 


Figure A3. C-section mutation 


Phase II 

Mutation 

Parameter 

Cl 

C2 

C3 

C4 

Chromosome 1 

A 

C 

T 

C 

Chromosome 2 

T 

T 

A 

c 

Parameter 

Cl 

C2 

C3 

C4 

Chromosome 1 

A 

T 

A 

c 

Chromosome 2 

T 

G 

A 

A 


Table A2. Children chromosomes 



N1 

N2 

N3 

N4 

N5 

N6 

Cl 

C2 

C3 

C4 

Chromosome 1 

0.23 

0.58 

1.25 

1.45 

-0.75 

2.25 

A 

T 

A 

C 

Chromosome 2 

0.58 

0.92 

1.75 

-2.57 

0.89 

1.25 

T 

G 

A 

A 
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ABSTRACT 

The importance of data cleaning and data quality is becoming increasingly clear, as evidenced by the 
surge in software, tools, consulting companies, and seminars addressing data quality issues. In this 
contribution, the authors present and describe how Bayesian computational techniques can be exploited 
for data-cleaning purposes to the extent of reducing the time to clean and understand the data. The pro- 
posed approach relies on the computational device named Bayesian belief network, which is a general 
statistical model that allows the efficient description and treatment of joint probability distributions. 
This work describes the conceptual framework that maps the Bayesian belief network computational 
device to some of the most difficult tasks in data cleaning, namely imputing missing values, completing 
truncated datasets, and outliers detection. The proposed framework is described and supported by a 
set of numerical experiments performed by exploiting the Bayesian belief network programming suite 
named HUGIN. 


INTRODUCTION 

Every data analysis task starts by gathering, char- 
acterizing, and cleaning a new, unfamiliar dataset 


(Dasu & Johnson, 2003). After this process, the 
data can be analyzed and the results delivered. It 
is well established that the first step is far more 
difficult and time consuming than the second. 
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Indeed, data gathering is complicated by sociologi- 
cal (turf sensitivity) and technological problems 
(different software and hardware platforms make 
transferring and sharing data very difficult). Once 
the data are in place, acquiring the metadata (data 
description and business rules) is another chal- 
lenge. Indeed, very often the metadata are poorly 
documented, and when we are ready to analyze the 
data, its quality is suspect. Fortunately, automated 
techniques can be applied to understand the data 
through exploratory data mining, and to ensure 
data quality through data cleaning. 

Data cleaning and quality monitoring is an 
incessant and continuous activity starting right 
from data gathering stage to the ultimate choice 
of analysis and interpretation of the results. It is 
needed to update the static conventional defini- 
tions and metrics of data quality to reflect the 
continuous and flexible nature of data quality pro- 
cess and metrics required to effectively measure 
and monitor data quality (Scannapieco, Missier, 
& Batini, 2005). 

According to a study conducted by The Data 
Warehouse Institute, commissioned by DataFlux 
(The Data Warehouse Institute, 2003), current 
data quality problems cost U.S. business more 
than 600 billion dollars a year. Furthermore, 
a survey from conversation to practitioners of 
data mining leads to assert that between 30% to 
80% of the data-analysis task is spent in clean- 
ing and understanding the data. Therefore, the 
importance of data cleaning and data quality is 
becoming increasingly clear as evidenced by the 
surge in software, tools, consulting companies, 
and seminars addressing data quality issues. 

A taxonomy of data-quality problems, ad- 
dressed by data cleaning, together with an over- 
view of the main solution approaches has been 
proposed by Rahm and Hai Do (2000). 

Several contributions, devoted to cleaning 
databases containing corrupted data, have been 
proposed in the specialized literature. 


Guyon, Matic, and Vapnik (1996) emphasize 
the link between informative patterns and data 
cleaning, and describe how machine learning 
approaches can be exploited to remove noise 
from a database. 

Schwarm and Wolf man (2000) pointed out that 
many techniques have attempted to use learners 
to predict problems with class values in datasets, 
with the same approach being extended to cor- 
rect errors in data. However, these techniques 
suffer several problems: they can only actually 
correct noise in the class attribute, do not fully 
leverage dependencies among attributes, and are 
inappropriate for datasets with no distinguished 
class attribute. 

As far as the authors know, Schwarm and Wolf- 
man were the first to propose the use of Bayesian 
belief networks for data cleaning. However, the 
approach described through this work significantly 
differs from their approach in the sense that it is 
unsupervised, and therefore it does not require the 
availability of any subset of precleaned instances 
from the database to be cleaned, as required by 
Schwarm and Wolfman (2000). 

Arning, Agrawal, and Raghavan (1996) de- 
scribed a linear time method for detecting devia- 
tions in a database. The authors assume that all 
records should be similar, which may not be true 
in unsupervised learning tasks, and that an entire 
record is either noisy or clean. 

The assumption that entire records are noisy 
or clean is also common in outlier and novelty 
detection (Hampel, Rousseeuw, Ronchetti, & 
Stahel, 1986; Huber, 1981). 

However, as pointed out in Kubika and Moore 
(2003), a significant downside to looking at noise 
on the scale of records is that entire records are 
thrown out, and useful, uncorrupted data maybe 
lost. In datasets where almost all records have 
at least a few corrupted cells, this may prove 
disastrous. The approach described in Kubika 
and Moore (2003) is particularly interesting, 
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and uses the data to learn a probabilistic model 
containing three components: a generative model 
of the clean records, a generative model of the 
noise values, and a probabilistic model of the 
corruption process. 

In this work, the authors introduce and describe 
a conceptual framework that maps the Bayesian 
belief network computational device to some of 
the most difficult tasks of data cleaning, namely 
imputing missing values, completing truncated 
datasets, and outliers detection. Furthermore, 
the conceptual framework is supported by a set 
of numerical experiments performed by means of 
three sample databases. Numerical experiments 
have been performed through the implementation 
of a software prototype that relies on the Bayesian 
belief network programming suite offered by the 
HUGIN software package (Andersen, Olesen, & 
Jensen, 1990). 

The rest of the work is organized as follows. 
The section titled “Bayesian Belief Networks” is 
devoted to present the main characteristics of the 
Bayesian belief network computational device. 
The “BBNs for Data Cleaning” section describes, 
through a sample database, the conceptual frame- 
work for the data-cleaning task by using the 
Bayesian belief network computational device. 
Furthermore, this section provides a quantita- 
tive comparison between the performance of the 
Bayesian belief network device and a commercial 
tool for outliers detection, namely the GritBot 
software package (Rulequest Research). Finally, 
conclusions and directions for further research 
are reported in the last section. 

BAYESIAN BELIEF NETWORKS 

Bayesian belief networks (BBNs) (Jensen, 1996; 
Neapolitan, 1990; Pearl, 1988), which emerged 
within a more general framework of Bayesian 
statistics, are general statistical models that allow 
the efficient description and treatment of joint 
probability distributions. BBNs proved to be a 


very useful tool for combining informal expert 
knowledge with statistical techniques for distribu- 
tion evaluation. BBNs are specifically designed for 
cases when the vector of random parameters can 
have considerable dimension and/or it is difficult 
to come up with traditional parametric models 
of the joint distribution of random parameters. 
BBNs have been utilized in application domains 
such as, for example, image processing (Geman & 
Geman, 1984), medical diagnosis (Spiegelhalter, 
Dawid, Lauritzen, & Cowell, 1993), modeling the 
Internet and the Web (Baldi, Frasconi, & Smyth, 
2003), and reliability analysis of integrated circuits 
manufacturing (Gaivoronski & Stella, 1998). 

A Bayesian belief network (BBN) is a graphi- 
cal model used to describe dependencies in a 
probability distribution function defined over a 
set of variables. Namely, dependencies among 
variables are represented in a graphical fashion, 
and exploited to decompose (factor) the joint 
distribution in terms of conditional independence 
relations defined over subsets of variables. In other 
words, a BBN is a graphical way of represent- 
ing a particular joint distribution factorization. 
Formally, a BBN model M consists of a set of n 
discrete random variables X lt ..,X n , and an underly- 
ing directed acyclic graph (DAG) G =(V, E ), such 
that each random variable is uniquely associated 
with a vertex of the DAG. The BBN model M is 
completely specified by means of the DAG G to- 
gether with a set of conditional probability tables 
P(X, | pa[X ; ]), z' = l,...,n,wherepa[X]denotesthe 
parents of node X , that is, the set of variables that 
directly influence the random variable X.. 

The main characteristic of the BBN model M 
is that its joint probability distribution, that is, 
the joint probability distribution for the random 
vector (X t ,...,XJ, can be represented through the 
factorization of the conditional probabilities for 
each random variable X. according to: 

P(X 1 ,...,X n ) = nP(X i |po[X i ]). (1) 
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In particular, in the case when the random 
variable X. has no parents (no directed links 
oriented towards the node associated with X), 
that is, the set pc/fX] is empty, the conditional 
probability P(X.|pa[X.]) is simply its marginal 
probability P(X ; ). As already mentioned, BBNs are 
often represented in a graphical form that helps 
in emphasizing the causal nature of the directed 
links between the random variables of the DAG. 
A classic example of BBN ( Figure 1 ) is provided 
by the “Hypothetical Medical Belief Networks” 
(Cooper, 1984; Pearl, 1988). This model consists 
of five binary (yes, no) random variables (nodes) 
namely, 

• X y , Metastatic Cancer 

• X 2 , Serum Calcium 

• X , Brain Tumor 

• X 4 , Coma 

• X 5 , Severe Headaches 

The directed links represent causal relation- 
ships between random variables. According to 
Figure 1, the “Metastatic Cancer” is a direct cause 
for “Serum Calcium” as well as for “Brain Tumor,” 


“Serum Calcium” and “Brain Tumor” are direct 
causes for “Coma,” and finally, “Brain Tumor” 
is direct cause for “Severe Headaches.” 

The BBN in Figure 1 describes the joint prob- 
ability distribution P(X p X 2 , X 3 , X 4 , X 5 ) according 
to equation (1) as follows: 

P(X 1 ,X 2 ,X 3 ,X 4 ,X 5 ) 

= P(X 1 )P(X 2 |X 1 )P(X 3 |X 1 ) 
P(X 4 |X 2 ,X 3 )P(X 5 |X 3 ). (2) 

As shown through the classic example of BBN 
in Figure 1, the power of BBNs is that one can in- 
fer conditional variable dependencies by visually 
inspecting the DAG and exploiting the concept 
of conditional independence (Pearl, 1988). The 
conditional independence property of BBNs is 
very important because it has a direct impact on 
the inference task that consists of deducing what 
a distribution over a particular subset of random 
variables is, given that one knows the states of 
some other variables in the network. More precise- 
ly, one needs to efficiently calculate a particular 
conditional or marginal probability distribution 
function from the one defined by the net. 


Figure 1. Medical Bayesian belief network 
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Another important feature of BBNs is that 
many algorithms for learning are available. There 
are several levels of learning in BBNs, ranging 
from learning the entire graph structure (the edges 
in the model) to learning the conditional distribu- 
tions when the structure is known (Baldi et al., 
2003). As a first approximation, four different situ- 
ations can be considered, depending on whether 
the structure of the network is known or not, and 
whether the network contains unobserved data, 
such as hidden variables, which are completely 
unobserved in the model. In the case when the 
structure is known and no hidden variables are 
suspected, the problem is relatively simple and 
consists of estimating conditional probabilities 
P(X.|pa[X.]) from observed frequencies. On the 
contrary, learning both the structure, that is, the 
DAG, and the parameters, that is, the conditional 
probabilities P(X.|pu[X.]) of a BBN where hidden 
variables are suspected, can be a prohibitive task. 
Reasonable computational approaches are avail- 
able in the intermediate cases: in particular, in 
the case when the structure is known but hidden 
variables are suspected, the Expectation Maxi- 
mization (EM) algorithm (Dempster, Laird, & 
Rubin, 1977). When the structure is unknown, 
but no hidden variables are assumed, a variety 
of search algorithms can be formulated to search 
for the structure (and parameters) that optimize 
some performance measure. These algorithms 
select the model M maximizing the likelihood 
P(D|M), while the Bayesian approach consists of 
selecting the model with the maximum posterior 
probability P(M|D) given the data D where we 
average over parameter uncertainty, 

P(M | D) oc Jp(D | 9, M)P(0 1 M)d0, (3) 

© 

where 9 represents the BBN parameters, that is, 
the conditional probability tables, P(D | 0, M) is 
the likelihood and P( 9 | M) is the prior. 

According to equation (3), there is an implicit 
penalty in effect; indeed, by averaging over the 
parameter space, instead of picking the most likely 


parameter: more complex models will in effect 
be penalized for having a higher-dimensional 
parameter space, and will only “win” if their pre- 
dictive power on the training data can overcome 
the inherent penalty that arises from having to 
integrate over a higher-dimensional space. 

Heckerman (1999) provides a full and detailed 
discussion of how the Bayesian estimation ap- 
proach can be exploited to automatically construct 
BBNs from data. 


BBNS FOR DATA CLEANING 

The data-cleaning task deals with several aspects 
of data quality improvement, as described by Dasu 
and Johnson (2003). 

There is no panacea, no single tool that can 
solve a majority of data quality problems. Indeed, 
data quality problems are highly complex and 
context dependent, requiring extensive domain 
knowledge and involving solutions that often need 
to be chosen case by case. Therefore, according 
to the data cleaning and data quality taxonomy 
introduced and described in Dasu and Johnson 
(2003), we decided to focus the attention to the 
following specific aspects of data cleaning: 

• Imputing missing values 

• Completing truncated datasets 

• Outliers detection 

In particular, imputing missing values is the 
process of guessing the values of missing data; 
completing truncated datasets is required when 
observations are dropped from the dataset for 
some reason, for example, customers who spend 
less than a dollar a year might not be included in 
a customer database; finally, outliers detection 
consists of identifying those observations that 
are not in line with the rest of the data. 

Let us now present the database that will be 
used in the sequel of this work to illustrate how the 
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BBN computational device can be appropriately 
exploited for solving the three data-cleaning tasks 
listed previously. 

The available database consists of 1,000 re- 
cords, concerning the Italian School-University 
system, where the following fields are recorded: 

• Age; ranges from 16 to 27 

• Gender; binary variable (Male, Female) 

• HighSchool; binary variable (yes, no) in- 
dicating whether or not the subject has the 
High School degree 

• 1 st Level Degree; binary variable (yes, no) 
indicating whether or not the subject has the 
1 st Level Degree 

• 2 nd Level Degree; binary variable (yes, no) 
indicating whether or not the subject has the 
2 nd Level Degree 

• PhD; binary variable (yes, no) indicating 
whether or not the subject has the PhD 

The first task that has been performed is the 
structural learning of the BBN model M from the 


available database D. We exploited the Learning 
Wizard offered by the HUGIN Researcher 6.3 
software package that maximizes the posterior 
probability P(M|D) given the database D accord- 
ing to equation (3). The Level of Significance 
parameter was set to 0.05 and the structural 
learning algorithm was set to PC. The resulting 
BBN model is depicted in Figure 2, where the 
right pane shows the learned BBN model, while 
the left pane visualizes, for each field, its marginal 
distribution over the corresponding support. 

Once, the BBN model, DAG, and conditional 
probability tables has been recovered, it is possible 
to proceed further and implement the three data- 
cleaning tasks considered through this work. 

The imputing missing values task can be ac- 
complished by achieving optimality. Indeed, it 
is well known that the Bayesian decision rule is 
optimal and that the Bayes Risk is the best per- 
formance that can be achieved (Duda, Hart, & 
Stork, 2001). In order to clarify how the imputing 
missing values task is accomplished, by using the 
BBN model depicted in Figure 2, let us consider 
the following record: 


Figure 2. BBN model for the School-University database 
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Field 

Value 

Age 

23 

Gender 

Female 

HighSchool 

yes 

1 st Level Degree 

yes 

2 nd Level Degree 

yes 

PhD 

? 


where the question mark (?) means that, for the 
given record, the PhD field value is missing. 

The imputing missing values task is accom- 
plished by first entering into the BBN model in 
Figure 2, left panel, the available information 
(evidence), values associated with the available 
fields for the given record, and then finding the 
state, for the PhD field, that maximizes the pos- 
terior probability given the available information. 
Formally, the imputing missing values task is 
accomplished by solving the following optimiza- 
tion problem, 

max P(PhD | £), (4) 

{yes, no} 

where in abbreviated form, 

Figure 3. Posterior probability P(PhD\E) 


E=(Age=23,Gen=Female,FlS=yes,lstLD=yes,2nd 

LD=yes), 

is the available information that is usually named 
evidence within the Bayesian framework. 

The optimization problem in equation (4) 
when solved, using the HUG1N Researcher 6.3 
software package, suggests to replace the miss- 
ing value with “PhD=no”. Indeed, the poste- 
rior probability for the state “no” is P(PhD=no 
|E) = 0.9986 while the posterior prob- 
ability for the state “yes” is P(PhD-yes 
|E)=0.0014 as shown in Figure 3. Notice that the 
missing replacement “PhD =no” would be the 
most probable replacement, irrespectively of the 
available data. Indeed, by ignoring the data for 
the given record, we obtain P(PhD=no)= 0.9760 
and P{PhD-yes)~ 0.0240 as shown in Figure 
4. Therefore, apparently the BBN model gives 
no advantage for solving the imputing missing 
values task. The advantage and power of BBNs 
becomes clear in the case when considering the 
following record, 
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Field 

Value 

Age 

27 

Gender 

Female 

HighSchool 

yes 

1 st Level Degree 

yes 

2 nd Level Degree 

yes 

PhD 

? 

' 


Indeed, the solution of the optimization prob- 
lem in equation (4), where, 

E=(Age=27,Gen=Female,HS=yes,lstLD=yes,2nd 

LD=yes), 

suggeststhemissingreplacement“PhD=yes” where 
P(PhD =yes) = 0. 842 1 and P(PhD=no)= 0.1579 
(Figure 5), while the most probable replacement 
irrespectively of the available data is obviously 
the same as before, that is, “PhD=no”. 

A different methodology for the treatment of 
missing data was introduced by Zaffalon (2002). 


In this approach, it is not tried to determine the 
most probable value to replace the missing obser- 
vation, but the observation is allowed to assume 
the whole set of possible values of the variable. It 
follows that the probability of an event concern- 
ing a variable with missing observations is no 
longer a single, precise value (point probability), 
but assumes a set of values, 

P e (p, p), 

according to the specific value hypothesized for 
the missing value (the probability is then called 
an imprecise probability, in the one dimensional 
case), while, in a higher dimensional case, the prob - 
ability vector ranges in a convex poly tope (called 
a credal set). Algorithms for the computation over 
credal sets of probabilities exist in literature, for 
some specific models (Fagiuoli & Zaffalon, 1998; 
Zaffalon & Fagiuoli, 2003) they are, in general, 
more complex than the equivalent algorithms for 
point probabilities; their treatment, however, goes 
beyond the scope of this work. 


Figure 4. Prior probability P(PhD) 
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The second data-cleaning task, namely the 
completing truncated datasets task, can be ac- 
complished by exploiting the same principle as 
the one exploited for the imputing missing values 
task. In this case, however, the optimization prob- 
lem to be solved involves more than one decision 
variable. In order to clarify how the completing 
truncated datasets task can be accomplished, let 
us consider the following record, 


Field 

Value 

Age 

26 

Gender 

Male 

HighSchool 

yes 

1 st Level Degree 

yes 

2 nd Level Degree 

7 

PhD 

7 


In this case, the optimization problem to solve 
is the following, 

max P(2ndLD, PhD | E ), 

{(yes, yes), (yes, no), (no, yes), (no, no)} 


Figure 5. Posterior probability P(PhD\E) 


where, 

E = (Age = 26, Gen = Male,HS = yes, 1st LD = yes) 

The solution of the optimization problem in 
equation (5) suggests that the available record has 
to be completed with the following assignment, 

2 nd Level Degree=yes and PhD =yes, 

whose posterior probability is P(2ndLD-yes, PhD 
=yes|E)=0.6556 while the posterior probabilities 
for the other configurations are P{2ndLD-no,Ph 
D=no\E)-0.001S,P(2ndLD~yes,PhD=no\E)-0.0 
029, P(2ndLD=no,PhD=yes\E)=0.3400. 

Notice that, when the available information 
E-(Age-26,gen-Male,HS-yes,lst LD-yes ), is 
not exploited, we obtain the probabilities, for the 
pair of database fields (2 nd Level Degree, PhD), 
reported in Table 1. Thus the considered record will 
be completed, according to the maximum prob- 
ability criteria, as follows : 2 nd Level Degree=no 
and PhD=no. 
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Therefore, it is evident how the record, com- 
pleted by assuming the independence between 
the database fields and using their prior prob- 
abilities, is nonoptimal, leading to the wrong 
record completion. 

An alternative for trying to recover the right 
probability distribution for the considered pair of 
fields (2 nd Level Degree, PhD) would be to isolate, 
from the entire database, those cases where the 
remaining fields assume the joint assignment, 
(Age-26, Gen-Male, HS, yes, 1st LD-yes), and then 
to compute an estimate of the joint probability 
for the pair of fields (2 nd Level Degree, PhD), 
similar to the one reported in Table 1. However, 
in the case when either the number of records 
to be completed and/or the number of fields in 
the database is great, this procedure will lead to 
a computationally intensive set of queries and 
subsequent computations. Furthermore, in the 
case where the number of fields to be completed 
is greater than two, as shown in our example, it 
will be very costly to compute an approxima- 
tion for the joint probability similar to the one 
depicted in Table 1. Indeed, it is evident how 
the computational complexity of the described 
procedure is exponential in both the number of 
fields to be completed and in the cardinality of 
their supports. 

All these motivations support the interest for 
the proposed BBNs approach to accomplish the 
completing truncated datasets task. Indeed, BBNs 
are capable to efficiently memorize the database 
content, and to efficiently scan it to recover local 
probability distributions. 

Finally, let us describe how the BBN com- 
putational device can be properly exploited to 
accomplish the outliers detection task. Starting 


from the definition of outlier, that is, an observa- 
tion that is not in line with the rest of the data, it 
is straightforward to conclude that BBNs are well 
suited for accomplishing such a complex task. 
Indeed, given a database, and once the structural 
learning of the BBN model has been performed, 
it is possible to efficiently evaluate high dimen- 
sional probability distributions. Therefore, given 
the BBN model M for the considered database D, 
the outlier detection task can be accomplished 
through the solution of the following optimiza- 
tion problem, 

minP(A 1 ,...,A n ). (6) 

The optimization problem in equation (6) 
can be solved efficiently by exploiting the BBN 
model, that is, by exploiting the factorization of 
the joint probability P(X 1 ,...,X n ) that, according to 
equation (1), can be written as follows, 

P(X 1 ,...,X n ) = YlP(X i \pa[X i ]). 

1=1 

Thus, the optimization problem in equation 
(6) can be decomposed into a set of easier opti- 
mization problems, that is, a set of optimization 
problems where few decision variables (database 
fields) are involved. A detailed discussion about 
computational algorithms for solving the optimi- 
zation problem on BBNs in equation (6) can be 
found in Gaivoronski and Stella (1998). 

The outliers detection task, that is, the solu- 
tion of problem in equation (6) related to records 
belonging to the given database and concerning 
the BBN depicted in Figure 2, allows to obtain 
a set of candidate outliers, some of which are 


Table 1. Joint probability for the random pair (2 nd Level Degree, PhD) 



2" d Level Degree=no 

2"' 1 Level Degree=yes 

PhD=no 

0.6708 

0.3052 

PhD=yes 

0.0165 

0.0075 
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reported together with their probability, that is, 
P(X V ...,X), in Table 2. 

For matter of brevity, the candidate outliers 
reported in Table 2 are only a fraction of those 
identified by means of the BBN model. 

The analysis of the candidate oudiers in Table 

2 allows the following comments. The candidate 
outlier 1 is clearly such due to the fact that it is not 
possible to obtain the 2 nd Level Degree without 
the 1 st Level Degree. Candidate outliers 2 and 

3 are explained by the fact that it is not possible 
to have a PhD without the 2 nd Level Degree. 
Furthermore, for candidate outlier 2 also the Age 
of 19 is suspect, due to the fact that it is unlikely 
that one can achieve a PhD at the Age of 19. The 
candidate outlier 4 is suspect due to the fact that 
it is unlikely that one can achieve a 2 nd Level 
Degree at the Age of 20, even though this can 


occur. The candidate outlier 5 is suspect for a 
reason similar to the one just described. Indeed, 
it is unlikely that one can have a PhD at the Age 
of 23. Other candidate outliers can be commented 
on a similar basis. 

To provide a performance comparisonbetween 
the BBN approach and other techniques, for the 
outliers detection task, we decided to scan the 
available database by using a trial license of the 
GritBot software package. This software package 
allows different filtering levels to detect outliers, 
but we decided to exploit its default values by 
obtaining 10 possible outliers reported in Table 
3, together with their GritBot relevance measure, 
which measures the reliability of the outlier. 

All the outliers identified by GritBot appear 
to be such. Indeed, the candidate outliers 1 and 
2 are explained by the fact that it is unlikely to 


Table 2. BBN candidate outliers 


N° 

Age 

Gender 

HighSchool 

I s ' Lev Deg 

2" d Lev Deg 

PhD 

Prob. 

1 

25 

Male 

Yes 

No 

Yes 

No 

0.00002206 

2 

19 

Male 

Yes 

Yes 

No 

Yes 

0.00004351 

3 

21 

Female 

Yes 

Yes 

No 

Yes 

0.00004543 

4 

20 

Male 

Yes 

Yes 

Yes 

No 

0.00004982 

5 

23 

Female 

Yes 

Yes 

Yes 

Yes 

0.00004992 

6 

24 

Female 

Yes 

Yes 

No 

No 

0.00005035 

7 

19 

Female 

Yes 

Yes 

Yes 

No 

0.00005120 

8 

17 

Male 

No 

No 

Yes 

No 

0.00005179 

9 

26 

Male 

No 

No 

No 

Yes 

0.00005588 

10 

18 

Male 

Yes 

Yes 

No 

Yes 

0.00008853 

11 

18 

Female 

Yes 

Yes 

No 

Yes 

0.00009105 

12 

24 

Female 

Yes 

Yes 

No 

No 

0.00010012 

13 

24 

Female 

Yes 

Yes 

Yes 

Yes 

0.00010012 

14 

24 

Male 

Yes 

Yes 

Yes 

Yes 

0.00010022 

15 

24 

Male 

Yes 

Yes 

No 

No 

0.00010022 

16 

22 

Male 

Yes 

Yes 

Yes 

Yes 

0.00014861 

17 

27 

Male 

Yes 

No 

No 

No 

0.00019471 

18 

27 

Female 

Yes 

No 

No 

No 

0.00019984 

19 

25 

Female 

No 

No 

No 

No 

0.00029797 

20 

19 

Female 

No 

No 

No 

No 

0.00030428 
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have a 2 nd Level Degree at the Age of 20 and 19. 
The candidate outlier 3 is explained by the fact 
that it is impossible to have the 2 nd Level Degree 
without the 1 st Level Degree and the HighSchool. 
Furthermore, the Age of 17 is also suspect. Candi- 
date outliers 4 and 5 are motivated by the fact that 
within the considered database, it is unlikely that 
at the Age of 24 one does not have the 2 nd Level 
Degree. The candidate outlier 6 is explained by 
the fact that it is not possible to have the 2 nd Level 
Degree without the 1 st Level Degree. Candidate 
outliers 7, 8, and 9 are explained by the fact that 
it is unlikely that, within the considered database, 
at the Age of 23 one does not have the 2 nd Level 
Degree, even though this could happen. Finally, 
the candidate outlier 10 is such due to the fact 
that it is not possible to have a PhD without hav- 
ing the 2 nd Level Degree. Furthermore, also the 
values for the fields HighSchool and 1 st Level 
Degree are suspect. 

To evaluate the capability of the BBN inFigure 
2, for accomplishing the outlier detection task, we 
introduce the following quantities: 

• Agreementw.r.t. GritBot ; given the candidate 

outliers, detected by means of the GritBot 
software package, it represents the ratio of 
the BBN correctly identified outliers to the 


number of the GritBot candidate outliers. 
This measure ranges from zero, that is, none 
of the candidate outliers from GritBot have 
been identified by the BBN, to one, that is, 
all the candidate outliers from GritBot have 
been identified by the BBN, 

• Percentage of database cases classified as 
outliers; is the ratio of the number of candi- 
date outliers identified by means of the BBN 
to the number of cases in the considered 
database, ft ranges from 0% to 100%, that 
is, all the database cases are identified by 
means of the BBN as candidate outliers. 

The results obtained for the School-Univer- 
sity database are reported through Table 4 and 
depicted in Figure 6. 

From Table 4 it is possible to conclude that the 
outliers detection task, accomplished by solving 
the optimization problem (equation (6)), in the 
case when the BBN model depicted in Figure 2 
is considered, allows to identify all the GritBot 
candidate outliers when classifying the 0.7% of 
the analyzed database cases as outliers. 

However, the GritBot software package does 
not identify all the candidate outliers correctly 
identified by means of the BBN model depicted 
in Figure 2. In particular, the GritBot software 


Table 3. GritBot candidate outliers (default parameters) 


N° 

Age 

Gender 

HighSchool 

1st Lev Deg 

2nd Lev Deg 

PhD 

Rel. 

1 

20 

Male 

Yes 

Yes 

Yes 

No 

0.002 

2 

19 

Female 

Yes 

Yes 

Yes 

No 

0.002 

3 

17 

Male 

No 

No 

Yes 

No 

0.003 

4 

24 

Female 

Yes 

Yes 

No 

No 

0.004 

5 

24 

Male 

Yes 

Yes 

No 

No 

0.004 

6 

25 

Male 

Yes 

No 

Yes 

No 

0.005 

7 

23 

Female 

Yes 

Yes 

No 

No 

0.008 

8 

23 

Female 

Yes 

Yes 

No 

No 

0.008 

9 

23 

Female 

Yes 

Yes 

No 

No 

0.008 

10 

26 

Male 

No 

No 

No 

Yes 

0.009 
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Figure 6. BBN-GritBot agreement: School-University database 



Table 4. BBN-GritBot agreement 


Agreement w.r.t. 

GritBot 

Percentage of database 

cases classified as outliers 

0.1 

0.01 

0.2 

0.04 

0.3 

0.07 

0.4 

0.08 

0.5 

0.09 

0.6 

0.12 

0.7 

0.15 

0.8 

0.67 

0.9 

0.68 

1.0 

0.69 


package does not detect the outliers number 2, 
3, 5, 10, 11, 13, 14, 16, 17, 18, 19, and 20 that are 
listed in Table 2. 


The comparison between the BBNs approach 
and the GritBot software package has been ex- 
tended also to two more databases described in 
the specialized literature, that is, the “churn” and 
the “hypothyroid” databases, which are made 
available together with the evaluation copy of the 
GritBot software package. 

In particular, the “churn” database consists of 
5,000 records, 20 attributes of which 16 continu- 
ous, 3 binary, and 1 that can assume 51 possible 
values. The GritBot package applied to the “churn” 
database, when using a filtering level equal to 
25% together with the parameter maximum condi- 
tions equals to 4, identifies 8 possible outliers. The 
corresponding BBN-GritBot agreement graph is 
reported in Figure 7. 

Figure 7 shows that 7 out of 8 possible outliers, 
identified by means of the GritBot software pack- 
age, are correctly identified by means of the BBN 
approach in the case when classifying the 6.2% of 
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Figure 7. BBN-GritBot agreement: Churn database 



Figure 8. BBN-GritBot agreement: Hypothyroid database 
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the “churn” database cases as outliers. However, 1 
out of the 8 candidate outliers identified by means 
of the GritBot software package is not evaluated as 
such by means of the BBN model unless the 13% of 
the “churn” database cases are classified as outliers. 
This problem maybe is due to the fact that all the 
continuous attributes (16 out of 20) of the “churn” 
database have been discretized before learning the 
BBN model. 

The “hypothyroid” database consists of 3,772 
records or cases; 23 attributes, of which 6 are continu- 
ous; 15 are binary; 1 can assume 6 possible values; 
and 1 can assume 4 possible values. The GritBot 
package applied to the “hypothyroid” database, 
when using the default parameters setting, identifies 
4 possible outliers. The BBN-GritBot agreement 
graph is reported in Figure 8. 

CONCLUSION AND FURTHER 
RESEARCH 

In this work, the authors described how the BBN 
computational device can be efficiently exploited 
for data cleaning and data quality improvement. 
The imputing missing values, completing truncated 
datasets, and outliers detection tasks have been 
addressed and discussed within the framework of 
BBNs. The examples described through the section 
named “BBNs for Data Cleaning,” as well as the 
results of the performed numerical experiments, 
emphasize the importance and relevance of the BBN 
computational device to efficiently deal with data 
cleaning and data quality improvement. Directions 
for further research include both enlarging the set 
of data cleaning and data quality tasks that can be 
dealt with by using the BBN computational device. 
Futhermore, it would be desirable to exploit the 
BBNs expressiveness to allowthe efficienttreatment 
of databases includingcontinuous attributes as well 
as textual data enlarging the application domain of 
BBNs by trying to exploit the BBNs expressiveness 
to allow the efficient treatment of databases including 
continuous attributes as well as textual data. 
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ABSTRACT 

Data quality is an important factor in building effective classifiers. One way to improve data quality 
is by cleaning labeling noise. Label cleaning can be divided into two stages. The first stage identifies 
samples with suspicious labels. The second stage processes the suspicious samples using some revision 
scheme. This chapter examines three such revision schemes: (1 ) removal of the suspicious samples, 
(2) automatic replacement of the suspicious labels to what the machine believes to be correct, and (3) 
escalation of the suspicious samples to a human supervisor for relabeling. Experimental and theoreti- 
cal analyses show that only escalation is effective when the original labeling noise is very large or very 
small. Furthermore, for a wide range of situations, removal is better than automatic replacement. 


INTRODUCTION 

Most pattern recognition systems are built by fit- 
ting a model to a training dataset. The accuracy 
of the resulting classifier is dependent on both the 
choice of model and the quality of the dataset. 


However, based on practical experience, most of 
the standard models (Duda, Hart, & Stork, 2001) 
tend to have comparable accuracy. In fact, simple 
models, such as Nearest Neighbor and Decision 
Trees, have remained very popular among practi- 
tioners because they are easy to work with while 
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being almost as accurate as the much more elabo- 
rate models. Some researchers have concluded 
that the choice of a model is insignificant when 
one has a large, high-quality training dataset (Ho 
& Baird, 1997). 

Many researchers have therefore examined 
ways to improve the quality of datasets by clean- 
ing up noise in the data. Generally, there are two 
types of noise: attribute noise and labeling noise. 
Attribute noise refers to errors in the attribute 
values of data samples; labeling noise refers to 
errors in the class labels of data samples. In this 
chapter, we will focus on the issue of cleaning 
labeling noise. 

An important factor in building good pat- 
tern classifiers is the quality of training data. A 
leading source of degradation of training data 
is labeling noise. Cleaning such noisy data can 
be divided into two stages. The first stage is the 
identification ofpossibly mislabeled samples; such 
samples are called suspects. The second stage is 
the processing of those suspicious samples. There 
are three major schemes to this revision process: 
(1) removal simply removes the suspects from the 
dataset (Brodley & Friedl, 1996; Brodley & Friedl, 
1999; Zhu, Wu, & Chen, 2003), (2) replacement 
changes the labels of the suspects to what the 
machine believes are the correct classes (Castelli, 
Hutchins, Li, & Turek, 2001; Gimlin & Ferrell, 
1974; Gowda & Krishna, 1979; Shanmugam & 
Breipohl, 1971; Teng, 1999), and (3) escalation 
asks a human supervisor (labeler) to look at the 
samples and provide the correct labels (Guyon, 
Matic, & Vapnik, 1996). 

Choosing among the three revision schemes 
would be a moot issue if the first stage of identi- 
fication is perfect. For such an ideal situation, all 
three schemes would result in noise-free datasets 1 , 
and one would certainly choose to use automatic 
replacement since removal would have created a 
smaller dataset, while escalation has the added ex- 
pense of getting a human supervisor involved. 

In practice, one cannot perfectly identify mis- 
labeled samples automatically. Thus, the choice 


of the proper revision scheme is complicated. 
Mistakes in the identification stage can propagate 
to the revision stage and may even be magnified 
there. For example, removal can erroneously re- 
move a correctly-labeled sample, and replacement 
can erroneously change a correct label. Relabeling 
by a human expert, on the other hand, is much 
more robust, yet it is also more costly. 

Most literature on cleaning mislabeled data 
has focused on the identification stage, while 
the choice of revision scheme was arbitrary or 
based on unique factors. Removal, for example, 
may be chosen if there is a desire to shrink the 
training dataset for reasons of storage or compu- 
tational efficiency. Escalation, for example, may 
be chosen if high accuracy is strongly desired and 
labelers are essentially free (Guyon et al., 1996). 
There has been little systematic comparison of 
the improvement in accuracy from each of the 
revision schemes. This chapter attempts to bring 
some insights on how to choose the right revision 
scheme. 

RELATED WORK 

While labeling noise can occur in almost all 
application areas, it is especially common in 
some domains such as remote sensing (Brodley 
& Friedl, 1999; Smyth, Fayyad, Burl, & Perona, 
1996), medical diagnosis (Dawid & Skene, 1979; 
Gamberger, Lavrac, & Groselj, 1999), and natu- 
ral language processing (Blaheta, 2002; Eskin, 
2000). For remote sensing and medical diagnosis, 
in principle, one can obtain completely accurate 
labels, but in practice it is usually not the case, 
due to the exorbitant cost involved. One has to 
settle for approximate judgments from indirect 
information. 

For natural language processing, several re- 
searchers have looked into domain specific ways to 
mitigate the labeling noise problem. Eskin (2000) 
has examined probabilistic methods for detecting 
anomalous tags in the Penn Treebank corpus. 


221 


A Comparison of Revision Schemes for Cleaning Labeling Noise 


These anomalies are considered mislabels. Blaheta 
(2002) has examined deterministic methods for 
cleaning common types of tagging errors. 

While labeling noise presents an obvious 
problem fortraining, it also complicates the evalu- 
ation. Any automatic cleaning of the testing data 
would skew its distribution and bias the evaluation. 
Some recent works have addressed issues around 
labeling noise in the test dataset (Lam & Stork, 
2003; Ng, 1997). For example, Lam and Stork 
(2003) have demonstrated that under realistic 
assumptions, the true error rate of a classifier is 
bounded by the apparent error rate, plus or minus 
the mislabeling rate of the test dataset. 

Identifying Mislabeled Data 

There has been extensive literature on identify- 
ing mislabeled data. We review some of them in 
this section. 

The k-nearest-neighbor algorithm and its vari- 
ants (Instance-Based Learning, Lazy Learning, 
Case-Based Learning, etc.) have received consid- 
erable attention on the issue of label noise cleaning, 
especially by the removal scheme. However, the 
main motivation behind much of that work has 
been to reduce the training set size for storage 
and computational reasons, rather than improving 
accuracy. Removing mislabeled samples is one of 
the logical approaches to shrink the dataset, and 
it is often combined with other approaches, such 
as removing inconsequential samples far from 
the decision boundaries. 

One of the earliest cleaning algorithms is 
Wilson Editing (Wilson, 1972). Under Wilson 
Editing, a sample is removed (i.e., “edited”) if its 
label is different from the labels of the majority 
of its neighbors. 

Aha et al. (Aha, 1991) introduce the instance- 
based learning algorithms IB1, IB2, and IB3. IB3 
in particular removes noisy samples. It sequen- 
tially examines each sample in a training set, and 
a sample is kept if it has been misclassified by 
previous samples but contributes significantly to 


the correct classification of later samples (using 
the nearest-neighbor classifier). The idea is to 
remove unimportant samples that do not border 
the decision boundaries, as well as noisy samples 
that do not contribute to correct classification. 

Wilson and Martinez (1997) present three 
reduction techniques: RT1, RT2, and RT3. Their 
techniques involve the concept of associate. A 
sample is considered an associate of x if x is one 
of the sample’s k nearest neighbors, where k is 
generally a small odd integer. Their heuristics 
remove x if its removal does not increase the 
number of its associates being misclassified. 
Again the concept is to remove samples that do 
not make important contributions to the overall 
classification accuracy; some of those samples 
are probably mislabeled. 

Brighton and Mellish (2002) has a review of 
removal methods for kNN, including their own 
iterative case filtering algorithm (ICF). The ICF 
algorithm first uses Wilson Editing for remov- 
ing suspicious samples. Afterward ICF further 
reduces training set size by eliminating redun- 
dant samples in the interior of decision regions. 
Brighton and Mellish have made the theoretical 
observation that, for nearest neighbor classifica- 
tion, it is always possible to reduce the training 
set to the size of the testing set without affecting 
the classifier’s accuracy on the testing set. 

Brodley and Friedl (1996, 1999) describe a gen- 
eral identification scheme that does not require the 
samples to be in a metric space. The scheme uses 
m learning algorithms and n-fold cross-validation 
to identify mislabeled training data. That is, the 
training data is first divided into n parts. For each 
part, the m algorithms are trained on the other n-1 
parts, and a sample is considered mislabeled if 
m’ or more of the m algorithms misclassify that 
sample. Brodley and Friedl call the case m’-m as 
consensus filtering and the case m’ - (m + l)/2 as 
majority filtering. 

Gamberger et al. (1999) developed the satura- 
tion filter. Theoretically, if a training dataset D is 
noiseless and saturated (containing enough data 


222 


A Comparison of Revision Schemes for Cleaning Labeling Noise 


points to find a correct target hypothesis), and D n is 
the union of D and {(x; y)}, where (x; y) is a noisy 
data point not correctly classified by the target hy- 
pothesis, then the CLCH (Complexity of the Least 
Complex Hypothesis) value g(D) is less than g(DJ. 
The saturation filter thus tries to remove data points 
that would reduce the CLCH value of the dataset. 
Gamberger et al. applied the saturation filter to a 
medical domain and obtained favorable results. 

John (1995) developed ROBUST-C4.5, an exten- 
sion of the C4.5 decision tree induction with built-in 
removal of mislabeled data. ROBUST-C4.5 first 
trains a decision tree, using C4.5 with pruning. It 
then removes all samples in the training set that the 
decision tree misclassifies. After that the remaining 
data is used to train a new decision tree, and the 
process is repeated until all samples are classified 
correcdy. 

Teng (1999) developed a procedure, called pol- 
ishing, that attempts to correct both attribute and 
labeling noise. It is an example of labeling cleaning 
using the replacement scheme (Castelli et al., 2001; 
Gimlin & Ferrell, 1974; Gowda & Krishna, 1979; 
Shanmugam & Breipohl, 1971). Polishing exploits 
interdependencies among attributes as well as 
interdependencies between attributes and target 
class. These interdependencies are used to predict 
and correct attribute and label noise. 

Guyon et al. (1996) is the only published work 
we know of that escalates suspicious samples for 
relabeling by a human expert labeler. They propose 
a cleaning method where a human supervisor checks 
those samples that have the largest information 
criterion and are therefore most “surprising.” While 
most automatic cleaning methods assume surpris- 
ing patterns to be garbage, Guyon et al. argue that 
surprising patterns can also be informative and hu- 
man judgment should be exercised to discriminate 
between the two. 

Many of the cleaning programs discussed have 
trouble scaling up to large datasets. Recently, Zhu et 
al. (2003) developed a scheme called partition filter 
to address the problem of label cleaning in large, 
distributed datasets. 


Unfortunately, with the exception of Aha et 
al. (1991) and Gowda and Krishna (1979), none of 
the filtering experiments we found in the litera- 
ture has tried to characterize the effectiveness of 
filtering under different severity of mislabeling. 
In fact, most experiments have only compared 
a classifier trained on a dataset with the same 
classifier trained on the filtered dataset. It is as- 
sumed that labeling noise exists in the original 
dataset, and the improved accuracy from train- 
ing on the filtered dataset is mainly due to the 
removal of the mislabeled samples. However, 
another legitimate line of argument is that data 
filtering would have improved accuracy even if 
there was no mislabeling. For example, filtering 
may just be a data driven form of regularization 
that prevents overfitting. 

k-k' NEAREST-NEIGHBOR 
IDENTIFICATION 

Many identification schemes exist, including the 
instance-based (IB) algorithms of Aha et al. (1991) 
and the reduction techniques (RT) of Wilson and 
Martinez (1997). The general approach is to find 
unusual samples whose removal does not decrease 
classification accuracy significantly. 

The identification scheme we examine is 
the k-k’ nearest-neighbor procedure (Gimlin & 
Ferrell, 1974). Consider a sample x with label y 
e {© 15 co 2 ). Let k be an odd integer and k’ be an 
integer greater than k/2 . Of the k nearest neighbors 
to x, if k’ or more of the patterns do not belong 
to classy, then sample x is considered a suspect. 
Note that Wilson’s editing algorithm (Wilson, 
1972) is a specialization of this procedure, with 
k’ - (k+ l)/2, and filtering as the default revision 
scheme. 

Note that the original k-k’ nearest-neighbor 
procedure studied by Gimlin and Ferrell (1974) 
is an online version of our batch algorithm. That 
is, the procedure receives samples one at a time, 
and it categorizes them as either suspects or not, 
based only on the samples it had seen previously. 
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In practice, the choice of online vs. batch cleaning 
will generally depend on the learning algorithm, 
which in turn depends on the application domain. 
An online learning algorithm should probably 
use an online cleaning procedure, while a batch 
learning algorithm can use either batch cleaning or 
online cleaning. Batch cleaning will clean better, 
but if storage capacity is a concern and one can 
keep only a subset of training data, then online 
filtering makes sense. This chapter considers only 
the batch procedure, so that the samples’ ordering 
effect will not create unnecessary variance in the 
experiments. 

The parameters k and k’ together specify the 
level of confidence necessary before one can 
categorize a sample as suspicious. That is, it 
determines the level of aggressiveness in clean- 
ing the data. For example, the pairs (k = 5; k’ 
= 3) will clean aggressively, whereas the pairs 
(k = 5; k’ = 5) will process a sample only if the 
algorithm is very confident that the sample has 
been mislabeled. 


THEORETICAL ANALYSIS 

In the following theoretical analysis, we assume a 
nearest-neighbor classifier and a training dataset 
of infinite size, so that a test sample would have a 
nearest neighbor at exactly the same point in the 
feature space. We denote this neighbor x, and it 
has a true but unknown label y. The dataset has a 
noisy versiony, which is modeled as a random flip- 
ping (with probability s) ofy. Cleaning algorithms 
try to substitute y withy , where the subscript (t) 
denotes a cleaned version of y. The hope is that 
y is more likely than y to equal toy. Note that in 
the removal scheme, we technically would have 
removed the neighbor and its label. However, in 
the limiting case of infinite data samples, the next 
closest neighbor will be at the same location. Thus 
we still denote the next closest neighbor as x, but 
its label is nowy . 


To simplify notation, we denote Pr[y = co. | x] 
as just p(coJx). The Bayes’ error rate at point x is 
therefore r*(x) = min[p(co |x),p(co |x)]. The error 
rate of the nearest-neighbor classifier using true 
labels y is, 

r y (x) = pC®! | x)p(ra, 1 x) + p(ra 2 1 x)p(ro 1 1 x) 

= 2r*(x)(l- r*(x)). (1) 

and this error rate is well known to be less than 
2r*(x) (Cover & Hart, 1967). The noisy labels are 
assumed to be distributed according to Pr[y = ©. 
| x] = (1- s) p(co.|x) + e(l-p(©.|x)), where mislabel- 
ing rate s < 0:5. The error rate using data with 
noisy labels y is, 

r- y (x) = (1 - s)r y (x) + s(l - r y (x)) 

= r y (x) + e(l-2r y (x)) 

= s + (l-2s)r y (x) 


It can be seen from the last two equations that 
the error rate from the noisy data is higher than 
both r v (x), the error rate given true labels, and 
the mislabeling rate s. In addition, the error rate 
increases linearly with s. 

The error rate using data with cleaned labels 
y is derived analogously, 

r~ yt (x) = Pr[y + = y | x]r y (x) + Pr[y + * y | x](l ■ - r y (x)) 
= r y (x) + Pr[y t *y|x](l-2r y (x)) 

and it also increases linearly with Pr[y ^y | x]. 

We proceed to find the mislabeling rate Pr [y W 
y | x] of the cleaned data under the three different 
revision schemes. 

Replacement 


We denote the binomial distribution as 

(k\ 

Bin(k,z',p) = p'(l-p) k “'. The mislabeling 

VJ 
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rate with replacement, as derived in Gimlin and 
Ferrell (1974), is, 

k-k' 

Pr[y t * .y I x] = .pfaj |x)^]Bin(/c,z,Pr[y = tn 1 1 x]) 

1=0 

k-k' 

+p(ro 2 |x)^]Bin(k,z,Pr[y = ro 2 |x]) 

1=0 

k'-l 

+e- Yj Bin(/c,i,Pr[y = cj 1 |x]) 

i=fc— fc'+l 


If a sample has k’ or more of its k nearest 
neighbors belong to one class, then it is forced 
to belong to that class also, regardless of the 
original value of its label. The first two terms in 
the equation represent the mistake of forcing a 
label to the wrong value. The last term is due to 
not revising an incorrect label. 

Let f (x) = (1 - s)r*(x) + s(l - r*(x)). Note that 
f E (x) = Pr[v = co 1 |x] if r*(x)= p(coJx) and f(x) = 
Pr[ y = co 2 |x] if r*(x)= p(ca 2 |x). We can leverage the 
symmetry of the mislabeling rate Pr[>~ ^ y|x] to 
rewrite it in terms of s and r*(x). 


Pr[y+ *y\x\ = . 


k-k' 


+ 


r ‘( X ) 'Z Bin ( k ’'’ fe( X )) 

i=0 

k-k' 

(l-r*(x))-£Bin(k,z,l- f E (x)) 

1=0 

s- Y Bin(k,z, f E (x)) 


Escalation 

Under escalation, a suspicious label is escalated 
to a human supervisor for correction. Assum- 
ing the mislabeling rate of this human labeler is 
s h (x), the mislabeling rate of the cleaned data is, 

(see Box 1). 

The first two terms represent the probability 
the suspicious labels willbe relabeled incorrectly. 

The next two terms represent the probability that 
a label agrees with k’ or more of its k neighbors, 
but is in fact wrong. The last term represents the 
probability of mislabeling when fewer than k’ 
neighbors can agree on a class. 

Note that, 

Pr[y = rs 1 ,y - w 2 \ x]. = Pr[y = to 2 |x,y = TO 1 ]p(rn 1 1 x) 
= Pr[y ^y|x,y = ra 1 ]p(ro 1 |x) 

= s- p(c5 1 |x) 

and analogously Pr[y=o) 2 , y=© 1 |x] = s-p(© 2 |x). 
Again, recognizing the formal symmetry in this 
equation, we can rewrite the mislabeling rate 
Pr[v + ^y|x] as, (see Box 2). 

Removal 

We think ofremoval in the infinite data scenario as 
a form of relabeling. That is, removal is equivalent 


Box 1. 
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Box 2. 



to letting the label of the nearest neighbor (still 
at x) take the original suspected label’s place. If 
the nearest neighbor’s label is suspicious also, 
then the next nearest neighbor (still at x) is used, 
and so on. 

The repetition of the process can lead to compli- 
cated dependencies in mislabeling rates. To reduce 
that dependency, we will assume that cleaning is 
batch rather than sequential. That is, there is no 
ordering effect. We are not removing one sample 
at a time and then clean the other samples with 


that one sample missing. For analysis, we then 
make the approximation that a sample’s neighbor, 
being suspicious, is independent of the fact that 
the original sample itself is suspicious. This al- 
lows us to say that the probability of error when 
removing a suspicious label is just r~ t (x), or r (x) 
+ Pr[y t W_y |x](l - 2r(x)j. The mislabeling rate can 
then be written as, (see Box 3). 

Rewriting in terms of r*(x) and f(x), and then 
solving for Pr[ v ^y|x], the resulting mislabeling 
rate is shown next, where r y (x) is defined in equa- 
tion 1 as 2 r*(x ) (1 - r*(x)). (See Box 4.) 


Box 3. 
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Box 4. 



Discussion of Theoretical Analysis 

Figure 2 plots the mislabeling rate Pr[v | x] 
for the three revision schemes. For the escalation 
case, we have assumed s h (x) = 0. Diagonal lines 
where Pr[y ^ y | x] = s are drawn to represent 
the mislabeling rate of noisy data. Note that if 
the mislabeling rate is above the diagonal lines, 
then the cleaning procedure has actually degraded 
data quality. 

ft is not surprising that escalation never de- 
grades data quality, since we have assumed that 
escalation is done by a (human) oracle with e h (x) 
= 0. On the other hand, both automatic replace- 
ment and removal can degrade data quality, and 
the degradation is a function of the Bayes’ error 
rate. The higher the Bayes’ error rate, the less 
effective replacement and removal are. 

EXPERIMENTS 

To corroborate the theoretical findings, we per- 
formed various experiments on the UCI machine 
learning datasets. We show the results for the 
OptDigit (digits “8” and “9”), Ionosphere, and 
Sonar datasets here. We vary the experiments 
on four dimensions. The first dimension is train- 


ing set size. We randomly sampled 80%, 40%, 
and 10% of the original dataset for training. 
The second dimension is the amount of labeling 
noise 8 injected. With probability s, a label in the 
training data is flipped to its opposite state. The 
third dimension of the experiments is the aggres- 
siveness of label cleaning, expressed by various 
combinations of (k, k’) values: (5,5), (5,4), and 
(5,3). The final dimension is the three revision 
schemes described previously. For the escalation 
scheme, we assume the human labeler is perfect. 
Escalation is thus simulated by revealing the true 
labels of all suspected data samples. 

Results using clean training data (before any 
noise is injected) and results for the noisy train- 
ing data (before any cleaning) are also shown for 
comparison. For each combination of training set 
size, labeling noise level, (k, k’) tuple, and revision 
scheme, 30 iterations of the experiment were run, 
and the average is reported. In all experiments, the 
nearest-neighbor classifier is used for testing. 

For our analysis, the main difference between 
the OptDigit, Ionosphere, and Sonar datasets is 
that OptDigit is the easiest dataset to learn while 
Sonar is the most difficult. Even when only 10% of 
the training data is used, OptDigit has an accuracy 
of 97%. We will note later that this ease-of-learn- 
ing does have an effect on label cleaning. 


227 


A Comparison of Revision Schemes for Cleaning Labeling Noise 


Results 

It is not surprising that clean, noiseless data al- 
ways provide the best classification accuracy; in 
general, noisy training data without any cleaning 
provide the worst accuracy. (As we will see, this 
is not always the case.) Using the nearest-neigh- 
bor classifier without any data cleaning, random 
labeling noise has a linear relationship with clas- 
sification accuracy. While larger training datasets 
increase classifier accuracy, this gain can easily 
be taken away if just a few percent of labeling 
noise is added and not properly handled. 

Of the three revision schemes, escalation to a 
perfect oracle consistently performs better than 
the other two schemes, especially for datasets 
with high labeling noise. This is explained by 
the fact that, as the noise level s approaches 0.5, 
there is less and less information in the dataset 
for any identification method to find mislabeled 
samples. The identifier’s assumption that anoma- 
lous samples are mislabeled breaks down. When 
s = 0:5, all the automated methods do is distribute 
the noise to neighboring regions. For the escalation 
scheme, at least some (random) samples will be 
corrected by a human labeler, and some accuracy 
will therefore be gained. 

The experiments show that the two automated 
cleaning schemes provide the most benefit under 
a medium amount of labeling noise; that is, the 
labeling noise is much less than 50% but far from 
nonexistent. This observation is hinted by and 
consistent with Shanmugam and Breipohl (1971) 
and Gimlin and Ferrell (1974). The intuitive ex- 
planation for the case of high labeling noise has 
already been stated. The case of low labeling 
noise takes a bit more explaining. 

In the Ionosphere and Sonar dataset, at low 
levels of labeling noise, the removal and replace- 
ment schemes actually do worse than using the 
uncleaned data. To the authors’ knowledge, the 
fact that cleaning low levels of labeling noise 
may decrease classifier accuracy has not been 
discussed in the literature before. 2 


However, the intuition about this phenomenon 
is not difficultto develop. If the labeling noise level 
is lower than the error rate of the identification 
algorithm, then the identification stage misidenti- 
fies more suspects than truly mislabeled samples. 
In turn, automatic propagation of identification 
errors under the removal and replacement schemes 
hurts the overall cleaning procedure. Furthermore, 
the accuracy in identifying mislabeled samples 
is theoretically bounded by the Bayes’ error rate 
of the feature domain. (No automated identifica- 
tion algorithm can identify mislabeled samples 
at a higher accuracy; otherwise, one can use the 
identification algorithm as a classifier and violate 
the accuracy bound.) In practice the accuracy is 
further lowered by the finite training set size. 
Therefore, there is generally a lower limit to the 
labeling noise level beyond which automated 
cleaning schemes tend to inject noise rather than 
clean. Fortunately, escalation transcends this limit 
by involving a human labeler who has a higher 
accuracy than the Bayes’ error rate, because she 
can examine the data from a higher dimensional 
space (i.e., the original data sample rather than 
an abstract feature vector). 

One of our initial concerns was that automatic 
removal of suspicious samples may work well in 
large datasets, but for small datasets, it may be 
discarding too much information. This effect was 
not observed in the OptDigits experiments, where 
there is enough redundancy such that a large reduc- 
tion in training set size has a negligible effect on 
overall accuracy, but the degradation is definitely 
seen for the Ionosphere and Sonar domains. 

However, while replacement does not discard 
attribute information as removal does, replace- 
ment does equal or worse than removal in all our 
experiments, although this difference shrinks for 
more conservative (e.g., ( k = 5; k’ = 5)) identifi- 
cation schemes. This suggests that removal is 
more robust to identification mistakes. That is, 
accidentally replacing a good sample with the 
wrong label does far more harm than accidentally 
removing that sample. On the other hand, correct- 
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ing a mislabeled sample is only marginally better 
than removing that sample, and lets its neighbors 
“smooth” over that part of the feature space. 

The aggressiveness of identification (and 
cleaning) only magnifies the effect of the revision 


Figure 1. Legend for interpreting the plot lines 
in Figures 2, 3, 4, and 5. The lines represent re- 
sults from using the clean training set (no noise 
injected), using the noisy training set (no clean- 
ing), and using the three revision schemes. Both 
“clean” and “noisy” results are shown as solid 
lines, although it should be obvious from context 
how to differentiate them. 

clean (red) 

— — — escalation 

removal 

— — — — replacement 

noisy (blue) 


scheme. Given that now one knows what situation 
to apply cleaning schemes, it only makes sense 
for one to either not clean at all (because doing 
so may actually degrade data quality) or clean 
aggressively. 

CONCLUSION 

Our analysis shows that automatic replacement 
is always a poorer revision scheme than removal. 
However, while removal works fine under a me- 
dium mislabeling rate, it performs poorly when 
the mislabeling rate is either low or high. Removal 
can degrade data quality when mislabeling rate 
is low relative to the Bayes’ error rate. For any 
given situation, it is usually best to either clean 
aggressively or not clean at all. Only in some rare 
cases does cleaning conservatively improve data 
quality, but cleaning aggressively degrades it. 


Figure 2. Theoretical mislabeling rates for various revision schemes given infinite data. Figure 1 shows 
the legend for matching the plot lines with revision schemes. The mislabeling rate for clean data is zero 
and is not shown. The diagonal line represents the mislabeling rate of the noisy data at Prfy f y\x] = 
s. That is, any cleaning method that changes the mislabeling rate to below this line has improved data 
quality. Analogously, being above this line means that data quality has degraded. 
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Figure 3. Accuracy of different revision schemes for the digits “8” and “9” in the UCI OptDigits data- 
set. Each column represents a different training set size. Each row represents different parameter value 
pairs for the k-k’ identification scheme. Figure 1 shows the legend for matching lines in the plots with 
revision schemes. 



Figure 4. Accuracy of different revision schemes for the UCI Ionosphere dataset. Each column represents 
a different training set size. Each row represents different parameter value pairs for the k-k’ identification 
scheme. Figurel shows the legend for matching lines in the plots with revision schemes. 
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Figure 5. Accuracy of different revision schemes for the UCI Sonar dataset. Each column represents a 
different training set size. Each row represents different parameter value pairs for the k-k ’ identification 
scheme. Figurel shows the legend for matching lines in the plots with revision schemes. 



REFERENCES 

Blaheta, D. (2002). Handling noisy training and 
testing data. In Proceedings of the 7 h Conference 
on Empirical Methods in Natural Language Pro- 
cessing (pp. 111-116). Philadelphia, PA. 

Brighton and Mellish (2002). 

Brodley, C. E., &Friedl, M. A. (1996). Identifying 
and eliminating mislabeled training instances. In 
Proceedings of the 13 th National Conference on 
Artificial Intelligence (pp. 799-805). Portland, 
OR: AAAI Press. 

Brodley, C. E., & Friedl, M. A. (1999). Identify- 
ing mislabeled training data. Journal of Artificial 
Intelligence Research, 11, 131-167. 

Castelli, V., Hutchins, S. T., Li, C.-S., & Turek, 
J. J. E. (2001). Modifying an unreliable training 


set for supervised classification. United States 
Patent 6,298,351. 

Cover, T. M., & Hart, P. E. (1967). Nearest neigh- 
bor pattern classification. IEEE Transactions on 
Information Theory, IT-13, 21-27. 

Dawid, A. P., & Skene, A. M. (1979). Maximum 
likelihood estimation of observer error-rates us- 
ing the EM algorithm. Applied Statistics, 28(1), 
20-28. 

Duda, R. O., Hart, P. E., & Stork, D. G. (2001). 
Pattern classification (2 nd ed.). John Wiley & 
Sons. 

Eskin, E. (2000). Detecting errors within a corpus 
using anomaly detection. In Proceedings of the 
1 st Conference of the North American Association 
for Computational Linguistics. Seattle, WA. 


231 


A Comparison of Revision Schemes for Cleaning Labeling Noise 


Gamberger, D., Lavrac, N., & Groselj, C. (1999). 
Experiments with noise filtering in a medical 
domain. In Proceedings of the 16 th International 
Conference on Machine Learning (pp. 143-151). 
Bled, Slovenia. 

Gimlin, D. R„ & Ferrell, D. R. (1974). A k-k’ 
error correcting procedure for nonparametric 
imperfectly supervised learning. IEEE Transac- 
tions on Systems, Man, and Cybernetics, SMC- 
4(3), 304-306. 

Gowda, K. C., & Krishna, G. (1979). Editing 
and error correction using the concept of mutual 
nearest neighborhood. In Proceedings of the 
International Conference on Cybernetics and 
Society (pp. 222-226). IEEE Press. 

Guyon, I., Matic, N., & Vapnik, V. (1996). Dis- 
covering informative patterns and data cleaning. 
In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, 
& R. Uthurusamy (Eds.), Advances in knowledge 
discovery and data mining (pp. 181-203). AAAI/ 
MIT Press. 

Ho, T. K„ & Baird, H. S. (1997). Large-scale 
simulation studies in image pattern recognition. 
IEEE Transactions on Pattern Analysis and Ma- 
chine Intelligence, 19(10), 1067-1079. 

Lam, C. P., & Stork, D. G. (2003). Evaluating 
classifiers by means of test data with noisy labels. 
In Proceedings of the 18 th International Joint 
Conference on Artificial Intelligence (pp. 513- 
518). Acapulco, Mexico. 

Ng, A. Y. (1997). Preventing “overfitting” of 
cross-validation data. In Proceedings of the 14 th 
International Conference on Machine Learning 
(pp. 245-253). Nashville, TN. 


Shanmugam, K., & Breipohl, A. M. (1971). An 
error correcting procedure for learning with an 
imperfect teacher. IEEE Transactions on Systems, 
Man, and Cybernetics, SMC-1(3), 223-229. 

Smyth, P., Fayyad, U. M., Burl, M. C., & Perona, P. 
(1996). Modeling subj ective uncertainty in image 
annotation. InU. M. Fayyad, G. Piatetsky-Shapiro, 
P. Smyth, & R. Uthurusamy (Eds.), Advances in 
knowledge discovery and data mining (pp. 517- 
539). AAAI/MIT Press. 

Teng, C. M. (1999). Correcting noisy data. In 
Proceedings of the 16 th International Confer- 
ence on Machine Learning (pp. 239-248). Bled, 
Slovenia. 

Wilson, D. L. (1972). Asymptotic properties of 
nearest neighbor rules using edited data. IEEE 
Transactions on Systems, Man, and Cybernetics, 
2(3), 408-421. 

Zhu, X., Wu, X., & Chen, Q. (2003). Eliminating 
class noise in large datasets. In Proceedings of 
the 20 th International Conference on Machine 
Learning (pp. 920-927). Washington, DC. 

ENDNOTES 

1 Here we loosely say that the first stage is 
perfect if it correctly identifies all misla- 
beled data and correctly determines their 
true labels. For two-class problems, the two 
tasks are equivalent. 

2 One of the experiments in Aha et al. (1991) 
has shown that the IB3 algorithm can de- 
crease accuracy given a noiseless training 
dataset. The effect was simply noted and 
not fully explained. 
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ABSTRACT 

Every time a user links up to a Web site, the server keeps track of all the transactions accomplished 
in a log file. What is captured is the “click flow” (clickstream) of the mouse and the keys used by the 
user during the navigation inside the site. Usually every click of the mouse corresponds to the view- 
ing of a Web page. The objective of this chapter is to show how Web clickstream data can be used to 
understand the most likely paths of navigation in a Web site, with the aim of predicting , possibly online, 
which pages will be seen, having seen a specific path of other pages before. Such analysis can be very 
useful to understand, for instance, what is the probability of seeing a page of interest (such as the buy- 
ing page in an e-commerce site) coming from another page. Or, what is the probability of entering (or 
exiting) the Web site from any particular page. From a methodological viewpoint, we present two main 
research contributions. On one hand, we show how to improve the efficiency of the Apriori algorithm; 
on the other hand, we show how Markov chain models can be usefully developed and implemented 
for Web usage mining. In both cases, we compare the results obtained with classical association rules 
algorithms and models. 


INTRODUCTION 

In the last few years, the number of people that have 
used the Internet has enormously increased. Com- 


panies promote and sell their products on the Web, 
institutions provide information about their services, 
and single individuals exploit personal Web pages to 
be introduced to the whole Internet community 
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We will show how the information, concern- 
ing the order in which the pages of a Web site are 
visited, can be profitably used to predict the visit 
behaviour at the site itself. 

Every time a user links up to a Web site, the 
server keeps track of all the actions accomplished 
in a log file. What is captured is the “click flow” 
(clickstream) of the mouse and the keys used by 
the user during the navigation inside the site. Usu- 
ally every click of the mouse corresponds to the 
viewing of a Web page. Therefore, we can define 
the click-stream as the sequence of the Web pages 
requested. The succession of the pages shown by 
a single user during navigation inside the Web 
identifies a user session. Typically, the analysis 
only concentrates on the part of each user session 
concerning the access at a specific site. The set 
of the pages seen, inside a user session, coming 
from a determinate site, is known with the term 
server session. 

All this information can be profitably used to 
efficiently design a Web site. A Web page is well 
designed if it is able to attract users and address 
them easily to other pages within the site. A very 
important area in Web mining is the application of 
data mining techniques to discover usage patterns 
from Web data in order to optimally design a Web 
site, and to better satisfy needs of differentvisitors. 
This problem is known as Web usage mining, in 
contrast to Web context mining (analysis of the 
content of Web sites) and Web structure mining 
(analysis of Internet links): for more details on this 
see, for instance, Baldi et al (Baldi, Frasconi, & 
Smyth, 2003) or Chakrabarti (2003). 


The objective of our analysis is to use Web 
clickstream data to understand the most likely 
paths of navigation in a Web site, with the aim of 
predicting, possibly online, which pages will be 
seen, having seen a specific path of other pages 
before. Such analysis can be very useful to un- 
derstand, for instance, what is the probability of 
seeing a page of interest (such as the buying page 
in an e-commerce site) coming from another page. 
Or, what is the probability of entering (or exiting) 
the Web site from any particular page. 

The most frequent type of statistical analy- 
sis of Web clickstream data is the search of the 
most interesting association and sequence rules 
(see, for an introduction, Han & Kamber, 2000 
or Hand, Heikki, & Smyth, 2001); this search is 
accomplished by means of the well known Apriori 
algorithm (Agrawal, Mannila, Srikant, Toivonen, 
& Verkamo, 1995).Our research proposal is two- 
fold: it improves both the statistical analysis, by 
considering different Markov chain models; and 
the computational search algorithm, considering 
a Genmax type proposal. 

According to what typically is done in the 
data-mining literature and practice, we shall 
compare our proposal with standard approaches 
by means of a real case study. In the description 
of the analysis, we shall follow the steps of the 
data mining process as described, for instance, in 
Berry and Linoff (1997), Giudici (2003) or Hastie 
et al. (Hastie, Tibshirani, & Friedman, 2001). 

The database from which we start to illustrate 
our methodology is the result of the elaboration 


Table 1. Extract of the considered dataset 


c_value 

C_time 

c_caller 

70ee683a6df... 

140CT97:11:09:01 

Home 

70ee683a6df... 

140CT97: 11:09:08 

Catalog 

70ee683a6df... 

14OCT97:ll:09:14 

Program 

70ee683a6df... 

140CT97: 11:09:23 

Product 

70ee683a6df... 

140CT97: 11:09:24 

Program 
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of a log file concerning a site of e-commerce, 
described, for instance, in Giudici (2003). The 
whole data set contains 250,711 observations, 
each corresponding to a click, that describe the 
navigation paths of 22,527 visitors among the 36 
pages that compose the site of the Webshop. For 
illustrative purposes, Table 1 reports a very small 
extract of the available dataset. 

Table 1 describes, the user session of one Web 
visitor, indexed by the c_value 70ee683a6df . . . 
Specifically, the column c_caller describes the 
clicks done by the visitor, at the times described 
by c_time. 

In order to model the previous data, we have 
considered two main classes of statistical models: 
sequence rules and Markov Chains. In order to 
compare fairly the two approaches, we have used 
the same statistical tool, the SAS software. In the 
case of Markov chains, we have programmed part 
of the code using the IML language of SAS. 

Furthermore, in the case of sequence rules, 
we have compared the results obtained with the 
Apriori algorithm implemented in SAS with a 
recent proposal, the GenMax algorithm (Zaki & 
Hsiao, 2002), that we have implemented with the 
IML language. 

The structure of the chapter is as follows: in 
Section 2 we briefly recall what sequence rules 
are, in the context of association rules; in Section 

3 we present algorithms to efficiently find such 
rules, and, specifically, we compare the Apriori 
algorithm with a GenMax algorithm; in Section 

4 we introduce Markov chains for Web mining 
and, finally, in the last section, we present the 
experimental results concerning the comparison 
between classical sequence rules and Markov 
chains. 

SEQUENCE RULES 

We now recall what a sequence rule is. For more 
details, the reader can consult a recent text on data 
mining, such as Han and Kamber (2001) or, from 


a more statistical viewpoint, Hand et al (2001), 
Hastie et al. (2001) and Giudici (2003). 

An association rule is a statement between 
two sets of binary variables (itemsets), A and 
B, that can be written in the form A— >B, to be 
interpreted as a logical statement: if A, then B. If 
the rule is ordered in time, we have a sequence 
rule and, in this case, A preceeds B. 

In Web clickstream analysis, a sequence rule 
is typically indirect: namely, between the visit of 
page A and the visit of page B other pages can 
be seen. On the other hand, in a direct sequence 
rule, A and B are seen consecutively. 

A sequence rule model is, essentially, an 
algorithm that searches for the most interesting 
rules in a database. In order to find a set of rules, 
statistical measures of “interestingness” have 
to be specified. The measures more commonly 
used in Web mining to evaluate the importance 
of a sequence rule are the indexes of support, 
confidence, and the lift. 

In this chapter, we shall consider mainly the 
confidence index. The confidence for the rule A — > 
B is obtained dividing the number of server ses- 
sions that satisfy the rule (the so called “support” 
of the rule) by the number of sessions containing 
the page A (the support of the page). 

In other words, the confidence approximates 
the conditional probability that, in a server session 
in which page A has been seen, page B is subse- 
quently requested, (see e.g., Giudici, 2003). 

ALGORITHMS TO FIND SEQUENCE 
RULES 

In this section, we shall compare algorithms to 
extract association and, therefore, sequence rules, 
from a transactional database. The algorithms 
that we shall compare will be the Apriori, the 
backtracking, and the genmax algorithms. 

Let us define a k-itemset a subset of order k 
of the variables being analysed; in our case, the 
pages of the Web site under consideration. The 
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previous algorithms aim to find the most frequent 
itemsets, possibly of any order. Operationally, 
most frequent means all itemsets whose support 
passes a fixed threshold. 

The Apriori algorithm is the simplest of the 
three algorithms. It is level-wise: it obtains all 
frequent itemset of a level (order) before moving 
on to the following level. It consists of a series 
of join operations; at each step, it joins the most 
frequent k-itemsets found in the previous step; 
thus, generating k+l-itemsets. It then eliminates 
the k+1 itemsets that contain subsets that are not 
frequent, according to the following property: 


Table 2. Illustrative transactional database 


Transaction ID 

Items 

1 

ACTW 

2 

C D W 

3 

ACTW 

4 

ACDW 

5 

ACDTW 

6 

C DT 


Apriori property: All subsets of a frequent k- 
itemset must be frequent. 

To better illustrate the algorithm, consider the 
example described in Table 2. 

Based on the data in Table 2, the search strategy 
of the Apriori algorithm is described in Figure 
1, for a threshold support level equal to three out 
of six (50%). 

From Figure 1 it is clear that the algorithm 
is level-wise. The dashed blue lines separate 
the different levels; the itemsets in green are 
those that do not pass the threshold level, and 
the itemsets in red are those eliminated by the 
Apriori property. 

The weakness of the Apriori algorithm is 
that, as the number of frequent itemsets increases 
(e.g., when the number of Web pages is large), the 
algorithm gets worse both in terms of occupied 
storage and in computational capability required 
to apply the Apriori property. 

A first solution to this problem is the back- 
tracking algorithm. The backtracking algorithm 
employs a depth-first technique, where the search 
space does not proceed by levels, as in the Apriori 


Figure 1. The search space of the Apriori algorithm 
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algorithm, but rather by branches, generating a 
tree-like structure. 

Operationally, to each k-itemset found is 
associated with a combine set, that is the set of 
pages that, added to the k-itemset, generate k+1- 
itemsets whose support is above the established 
threshold (for brevity, in the following we shall 
say that they are frequent). 

Figure 2 illustrates the search strategy of the 
backtracking algorithm, for the example illus- 
trated in Table 2. The combine sets are shown in 
squared brackets. 

In Figure 2, note arrows show the direction 
of the search, which is clearly branch-wise rather 
than level-wise as in Figure 1. Notice that the 
coloured itemsets will not be generated by the 
search strategy, as no combine set will lead to 
them. We have left them in the figure to empha- 
size the difference between the Apriori and the 
backtracking algorithm. 

In other words, the backtracking algorithm 
exploit the Apriori property without actually 


computing it. This alleviates the computational 
burden of the Apriori algorithm, especially when 
the number of Web pages becomes large. 

However, for low values of the support 
threshold, the search space of the backtracking 
algorithm keeps a considerable dimension. To 
improve the efficiency of the algorithm in this 
case, a number of research papers suggest using 
maximal frequent itemsets (MFI). Maximal fre- 
quent itemsets are frequent k-itemset for which no 
superset is frequent; in other words, the combine 
set of an MFI is empty. 

Once all MFIs are extracted, all k-itemsets 
can be obtained as their subsets, as, according to 
the Apriori property, all subsets of MFIs are all 
frequent. The GenMax algorithm that we propose 
in this chapter employs the properties of certain 
sets, named tidsets, to extract MFIs. 

In our problem, the tidset is the set of all us- 
ers that visit a specific k-itemset. For the data 
described in Table 2, and k=l, the tidsets are 
those in Table 3. 


Figure 2. The search space of the backtracking algorithm 
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Table 3. Tidsets of all 1-itemsets 


Itemset 

Tidset 

A 

1, 3, 4, 5 

C 

1, 2, 3, 4, 5, 6 

T 

1, 3, 5, 6 

W 

1, 2, 3, 4, 5 

D 

2, 4, 5, 6 


The Genmax algorithm is built on a theorem 
introduced by Zaki and Hsiao (2002), which fol- 
lows. Let I, , u c. indicate the union between a 
k+l-itemset and the i-th element of its combine set 
and let c() indicate the closure operator , that is, 
the largest itemset with tidset equal to that under 
consideration. 

Let X.=J (+1 and X =/ / u{c.} two members of a 
set ofitemsets of level l+l, and let [X , t(X.)] and 
[X, t(X)] the corresponding pairs of itemset and 
associated tidset. The following hold: 


1. if t(X.)=t(X) then c(X.) =c(X.)= c(XuX) 

2. if t(X)c=t(X) then c(X.)*c(X.)/ but 
c(X.)=c(XuX.) 

3. if t(X.)r)t(X j .) then c(X.)^c(Xpbut 
c(Xp=c(X.uX.) J 

4. if f(X)#(X) then c(X)^c(X.)^c(XuX.) 

The previous properties can be used to generate 
an efficient search space. Figure 3 describes the 
search space of the GenMax algorithm. 

Figure 3 contains different colours to empha- 
size differences with respect to the Apriori and 
the backtracking algorithm. While the branches 
of the backtrack algorithm are in black, the 
branches of the GenMax algorithm are green, if 
search branches, or red, if obtained as updated 
frequent itemsets from the previous theorem. All 
MFIs are in yellow. 

By eliminating redundances in the application 
of the comparisons requested by the theorem, as 
well as the branches that generate false MFIs, 
that is, subsets of others, the search space can be 
further simplified as in Figure 4. 


Figure 3. The search space of the GenMax algorithm 
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Figure 4. The search space of the modified GenMax algorithm 



Figure 5. Comparison of computational efficiencies of the backtracking and GenMax algorithms 
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Comparing Figure 4 with Figure 3 and Figure 
2, the computational advantages appear evident. 
In order to check empirically the efficiency gain, 
we have conducted a simulation study, which we 
describe in Figure 4. 

Figure 4 shows the computational times needed 
to obtain the MFIs using, respectively, the back- 
tracking algorithm (blue colour) and the GenMax 
algorithm in the two versions that we described 
before, based on the Theorem (red colour) and in 
the version modified by us (yellow colour). 

From Figure 5, the advantage of modified 
GenMax algorithm is clear, and is comparatively 
higher as the support threshold decreases. This 
shows that the GenMax algorithms, especially in 
the version modified by us, is an efficient search 
algorithm that can overcome the difficulties of both 
the Apriori and the backtracking algorithms. 

MARKOV CHAINS FOR WEB 
USAGE MINING 

We now consider the statistical analysis of Web 
clickstream data. The standard practice is to 
employ sequence rules and graphs derived by 
them. This type of analysis is local, as the sta- 
tistical measures of interestingness considered 
(support, confidence, lift) are calculated for the 
itemsets at hand, that is marginally. This implies 
that such measures are typically not normalised 
(for a more detailed discussion of this see, for 
example, Giudici, 2003). 

Our proposal to overcome this problem is to 
introduce a global, that is, multivariate model that 
can also lead, as a byproduct, to measures similar 
to those employed to evaluate interestingness of 
association and sequence rules. The model that 
we propose here is a discrete Markov chain. 

The idea of Markov chains is to introduce de- 
pendence between time-specific variables. In each 
session, to each time point z, here corresponding 
to the z'-th click, it corresponds a discrete random 
variable, with as many categories as the number 


of pages (these are named states of the chain). The 
observed i-th page in the session is the observed 
realisation of the Markov chain, at time z, for 
that session. Time can go from i=l to i=T, and 
T can be any finite number. Note that a session 
can stop well before T: in this case the last page 
seen is said to be an absorbing state (end_session 
for our data). 

A Markov chain model establishes a probabi- 
listic dependence between what was seen before 
time z, and what will be seen at time z. 

Inparticular, afirst-order Markov chainestab- 
lishes that what was seen at time i depends only 
on what was seen at time i-1. This short memory 
dependence can be assessed by a transition matrix 
that establishes what is the probability of going 
from any page to any other in one step, only. For 
36 pages, there are 36 X 36 probabilities of this 
kind. 

The conditional probabilities in the transi- 
tion matrix can be estimated on the basis of the 
available conditional frequencies. If we add the 
assumption that the transition matrix is constant 
in time (homogeneity of the Markov chain), we 
can use the frequencies of any two adjacent pairs 
of time-ordered clicks to estimate the conditional 
probabilities. 

Note the analogy of Markov chains with direct 
sequences. Conditional probabilities in a first-or- 
der Markov model correspond to the confidence 
of order two direct sequence rules and, therefore, 
a first order Markov chain is a model for direct 
sequences of order two. The difference is that the 
Markov chain model is a global and not a local 
model. This is mainly reflected in the fact that 
Markov chains consider all pages and not only 
those with a high support. Furthermore, the Mar- 
kov model is a probabilistic model and, as such, 
allows inferential results to be obtained. 

First order Markov models have been shown to 
have lower precision than other Markov models. 
The most obvious generalizations of first order 
Markov models are models of the second and of 
the third order. It can be shown that a second-order 
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Markov model is a model for direct sequences 
of order three, a third-order Markov model is a 
model for sequences of order four, and so on. In 
general, a k-th order model is described by the 
following property: 

P(Pag n+1 |Pag n ,Pag n . 1 ,...,Pag 0 )=P(Pag n+1 |Pag n ,...,P 

a g„J 

Such a model demands, therefore, the calcula- 
tion of all M k possible combinations of the states, 
increasing the dimension of the transition matrix 
space. This because the states of these models are 
not just the pages (as in first order models), but 
all their possible combinations. This increase in 


Table 4. Number of MFI associated with each 
cluster 



Nr. MFI 

Cluster 1 

156 

Cluster 2 

363 

Cluster 3 

38 

Cluster 4 

826 

Cluster 5 

381 

Cluster 6 

65 


the number of states can limit the use of Markov 
models for applications in which the speed is es- 
sential, or storage memory is limited. 

A further problem is that a higher order Markov 
chain requires more data points for an efficient 
estimation; it becomes likely, for a high-order 
chain, that no data is present for a specific state. 
A simple way to overcome this problem is to use 
several Markov models at a time. For each state, 
if the greatest Markov model contains the request, 
it is used for prediction; otherwise, it is passed 
to the previous model and so on. This method is 
called All-K th -Order Markov model (Pitkow & 
Pirolli, 1999). 

EXPERIMENTAL RESULTS AND 
COMPARISONS 

We shall now compare the predictive accuracy of 
sequence rules with that of Markov chain models 
of different orders, using those described in sec- 
tion 1. We have split the database in a training and 
validation set, as is done in the cross-validation 
comparison approach. 

Furthermore, in order to better evaluate 
possible effects of different data structures on 


Table 5. Comparison of predictive performances between different models 



Markov 

First Order 

Markov 

Second Order 

Markov 

Third Order 

Sequenze 

Order two 

Sequence 

Order three 

Cluster 1 

58,9 % 

(0,00 %) 

63,8 % 

(0,72 %) 

67,9 % 

(2,52 %) 

51,9 % 

(0,00 %) 

54,1 % 

(1,59 %) 

Cluster 2 

73,0 % 

(0,00 %) 

70,7 % 

(0,10 %) 

73,7 % 

(0,34 %) 

43,8 % 

(0,00 %) 

39,8 % 

(0,84 %) 

Cluster 3 

75,8 % 

(0,00 %) 

86,2 % 

(0,22 %) 

88,7 % 

(0,95 %) 

71,5 % 

(0,00 %) 

88,8 % 

(4,72 %) 

Cluster 4 

66,7 % 

(0,00 %) 

63,9 % 

(0,17 %) 

67,6 % 

(0,53 %) 

42,9% 

(0,00 %) 

39,1 % 

(0,36 %) 

Cluster 5 

61,2 % 

(0,00 %) 

67,9 % 

(0,34 %) 

69,2 % 

(1,00 %) 

37,4 % 

(0,00 %) 

41,1 % 

(0,99 %) 

Cluster 6 

70,9 % 

(0,00 %) 

73,5 % 

(0,02 %) 

73,4 % 

(0,42%) 

52,4 % 

(0,00 %) 

50,9 % 

(1,46 %) 
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prediction accuracy, we have run a preliminary 
cluster analysis of the data, obtaining six different 
clusters of navigation behaviours. The number 
of MFIs associated with each cluster is shown 
in Table 4. 

From Table 4, note the rather different structure 
of the clusters. For example, while cluster 4 has 
826 MFIs, cluster 3 has only 38. 

The predictive performance of Markov chains 
and sequence models is compared, for each clus- 
ter, in Table 5. 

In Table 5, we report two figures for each 
combination of models and clusters. The first 
figure is the prediction accuracy; the second figure 
(in parentheses) is the percentage of predictions 
that cannot be made because of absence of data 
on which to base the estimates. The considered 
models are first, second, and third order Markov 
models as well as sequence rules models of order 
two and three. 

We can see that Markov chains perform 
generally better. The difference between the 
performance of Markov models and sequence 
rules is, however, not constant across clusters. 


It can be shown that this relative difference is 
proportional to the number of maximal frequent 
itemsets: the lower such number, the greater the 
gain in accuracy, as can be deduced comparing 
Tables 4 and 5. 

We finally compare the statistical efficiency 
(that is, predictive performance) of the All k-th 
order Markov model. Table 6 contains the results 
of such comparison. 

From Table 6, it appears that the gain of the all 
k-th order Markov model is negligible, especially 
with respect to the third-order model. We believe 
that the obtained gain does not compensate for 
the increased complexity and extra computational 
cost that the model bears. 
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Table 6. Comparison of predictive performances between All k-th order Markov model and low order 
models 



Markov 

1° Ord. 

Markov 

2° Ord. 

Markov 

3° Ord. 

All-K' h -Order 

Markov model 

Cluster 1 

58,9 % 

(0,00 %) 

63,8 % 

(0,72 %) 

67,9 % 

(2,52 %) 

68,3 % 

(1,50 %) 

Cluster 2 

73,0 % 

(0,00 %) 

70,7 % 

(0,10 %) 

73,7 % 

(0,34 %) 

73,5 % 

(0,21 %) 

Cluster 3 

75,8 % 

(0,00 %) 

86,2 % 

(0,22 %) 

88,7 % 

(0,95 %) 

88.5 % 

(0,51 %) 

Cluster 4 

66,7 % 

(0,00 %) 

63,9 % 

(0,17 %) 

67,6 % 

(0,53 %) 

67,6 % 

(0,34 %) 

Cluster 5 

61,2 % 

(0,00 %) 

67,9 % 

(0,34 %) 

69,2 % 

(1,00 %) 

69,2 % 

(0,62 %) 

Cluster 6 

70,9 % 

(0,00 %) 

73,5 % 

(0,02 %) 

73,4 % 

(0,42%) 

73,5 % 

(0,03 %) 
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ABSTRACT 

Probabilistic principal surfaces (PPS) is a nonlinear latent variable model with very powerful visu- 
alization and classification capabilities that seem to be able to overcome most of the shortcomings of 
other neural tools. PPS builds a probability density function of a given set of patterns lying in a high- 
dimensional space that can be expressed in terms of a fixed number of latent variables lying in a latent 
Q-dimensional space. Usually, the Q-space is either two- or three-dimensional and thus, the density 
function can be used to visualize the data within it. The case in which Q - 3 allows to project the pat- 
terns on a spherical manifold that turns out to be optimal when dealing with sparse data. PPS may 
also be arranged in ensembles to tackle complex classification tasks. As template cases, we discuss the 
application of PPS to two real- world data sets from astronomy and genetics. 
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INTRODUCTION 

The explosive growth in the quantity, quality, and 
accessibility of data that is currently experienced 
in all fields of science and human endeavor, has 
triggered the search for a new generation of 
computational theories and tools, collectively 
constituting the field of data mining, capable to 
assist humans in extracting useful information 
(knowledge) from huge amounts of distributed and 
heterogeneous data. This revolution has two main 
aspects: on the one hand in astronomy, as well as 
in high energy physics, genetics, social sciences, 
and in many other fields, traditional interactive 
data analysis and data visualization methods, 
have proved to be far inadequate to cope with 
data sets that are characterized by huge volumes 
and/or complexity (ten or hundreds of parameter 
or features per record, cf. Abello, Pardalos, & 
Resende, 2002, and references therein). In second 
place, the simultaneous analysis of hundreds of 
parameters may unveil previously unknown pat- 
terns that will lead to a deeper understanding of 
the underlying phenomena and trends. 

Knowledge discovery in databases or KDD 
is therefore becoming of paramount importance 
not only in its traditional arena, but also as an 
auxiliary tool for almost all fields of research. 
In this chapter, after a short introduction on the 
latent variable models, we shall first focus on the 
visualization and classification capabilities of the 
spherical probabilistic principal surfaces and then 
on the possibility to build PPS ensembles. Finally, 
we shall discuss two applications in the fields of 
astronomy and genetics. All results have been 
obtained in the framework of the Astroneural col- 
laboration: a j oint proj ect between the Department 
of Mathematics and Informatics of the University 
of Salerno and the Department of Physical Sci- 
ences of the University Federico II of Napoli. The 
main goal of the collaboration is to implement a 
user-friendly data-mining tool capable to deal with 
heterogeneous, high-dimensionality data sets. 
All software is implemented under the Matlab 


computing environment exploiting the LANS 
Pattern Recognition Matlab Toolbox (http://www. 
lans.ece.utexas.edu/~lans/lans/) and the Netlab 
Toolbox (Nabney, 2002). 

LATENT VARIABLE MODELS 

The goal of a latent variable model is to express 
the distribution p( t) of the variable t=(t , . . . , t ) 
in terms of a smaller number of latent variables 
x = (jq, . . . , xA where Q < D. To achieve this, 
the joint distribution p(t,x) is decomposed into 
the product of the marginal distribution p(x) of 
the latent variables and the conditional distribu- 
tion p(t|x) of the data variables, given the latent 
variables (Bishop, 1999). Expressing the condi- 
tional distribution as a factorization over the data 
variables the joint distribution becomes: 

D 

p(t,x) = p(x)p(t I x) = p(x>n P(*d I x )- (!) 

d=l 

The conditional distribution p(t|x) is then writ- 
ten in terms of a mapping from latent variables 
to data variables, so that t=y(x;w)+u. y(x;w) is a 
function of the latent variable x with parameters 
w, and u is an x-independent noise process. If the 
components of u are uncorrelated, the conditional 
distribution for t will factorize as in (1). Geometri- 
cally, the function y(x; w) defines a manifold in data 
space given by the image of the latent space. The 
definition of the latent variable model is completed 
by specifying the distribution p(u), the mapping 
y(x;w), and the marginal distribution p(x). The 
type of mapping y(x;w) determines the specific 
latent variable model. The desired model for the 
distribution p(t) of the data is then obtained by 
marginalizing over the latent variables: 

P(t) = Jp(t|x)p(x)dx. 

Although this integration will, in general, be 
analytically intractable, there exist specific forms 
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of the distributions p(t | x) and p(x) that lead to 
an analytic solution. 

Probabilistic Principal Surfaces 

Probabilistic principal surfaces or PPS (Chang, 
2000; Chang & Ghosh, 2001) is a nonlinear exten- 
sion of principal components in that each node 
on the PPS is the average of all data points that 
projects near/onto it. From a theoretical point of 
view, the PPS maybe seen as a generalization of the 
generative topographic mapping (GTM) (Bishop, 
Svensen, & Williams, 1998), which, on the other 
hand, can be seen as a parametric alternative to 
self-organizing maps (SOM) (Kohonen, 1995). 

Some advantages of PPS include the parametric 
and flexible formulation for any geometry/topol- 
ogy in any dimension, and the guaranteed conver- 
gence (indeed the PPS training is accomplished 
through the expectation-maximization (EM) 
algorithm (Dempster, Laird, & Rubin, 1977)). 

It has to be pointed out also that a PPS is 
governed by its latent topology and, owing to 
their intrinsic flexibility, a large variety of PPS 
topologies can be created. Among these, that of 
a 3-D sphere is particularly appealing since a 
sphere is finite and unbounded, and all nodes are 
distributed at the edge of the sphere; thus, making 


itideal for emulating the sparseness and peripheral 
property of high-D data. The PPS generalizes the 
GTM model by building a unified model, and 
shares the same formulation as the GTM, except 
for an oriented covariance structure for nodes in 
R D . This means that data points projecting near a 
principal surface node have higher influences on 
that node than points projecting far away from it 
(Figure.l). Finally, the sphere topology (with no 
edges such as, for instance, is the case for SOM) 
can be easily comprehended by humans, and 
thereby be extremely effective for the visualiza- 
tion of high-D data. 

Each node y(x;w), xe { x m} m =i, has covari- 
ance, 


w + “qi X e » e d( x ), 

P q= 1 P ~ Vj d=Q+ 1 

0 < a < — 

Q 


where, 

• (e <J (x )}^ =1 is the set of orthonormal vectors 
tangential to the manifold at y(x;w). 

• (e d (x) } d=Q+1 is the set of orthonormal vectors 
orthogonal to the manifold in y(x;w). 


Figure 1. Under a spherical Gaussian model of the GTM, points 1 and 2 have equal influences on the 
centre nodey(x) (a) PPS have an oriented covariance matrix so point 1 is probabilistically closer to the 
centre nodey(x) than point 2 (b) (Figure taken from Chang, 2000). 
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The complete set of orthonormal vectors 
(e d (x) \'J , spans R D . The parameter a is a clamp- 
ing factor and determines the orientation of the 
covariance matrix. The unified PPS model reduces 
to GTM for a = 1 and to the manifold-aligned 
GTM for a > 1. 


where r are the latent variable responsibilities 

mn r 

defined as: 
r mn = P( X „, I t„) 

p(t„ |X m )P(xJ _ P(t„|xj 

ZllP^n K') P ( X m') Eli, P^n l X m') 


0 <a <1 


2 (*) = 


a -1 


1 < a < D/Q 


T to the manifold 
I D or spherical 
|| to the manifold. 


Since || x m ||-1 and £ m r mn =1 for n= l,..., N, 
these coordinates lie within a unit sphere, that 
is || x ||< 1 


The EM algorithm can be used to estimate 
the PPS parameters W and p, while the clamping 
factor is fixed by the user and is assumed to be 
constant during the EM iterations. If we choose 
a 3-D latent space, a spherical manifold can be 
constructed using a PPS with nodes (x ra }^ =1 ar- 
ranged regularly on the surface of a sphere in R 3 
latent space, with the latent basis functions evenly 
distributed on the sphere at a lower density. After 
a PPS model is fitted to data, the data themselves 
are projected for visualization purposes into the 
latent space as points on a sphere (Figure 2). 

The latent manifold coordinates x„ of each 
data point t n are computed as, 

r M 

Xn = <x 1 1„> = J xp(x I t)dx = £ r mn x m , 

m = 1 


SPHERICAL PPS AS DATA 
VISUALIZATION TOOLS 

From the visualization point of view, the software 

implemented within Astroneural allows to: 

a. Interact with data into the latent space in 
several ways. 

b. Visualize the data probability density func- 
tion in the latent space in order to derive a 
first understanding about the clusters exist- 
ing in the data. 

c. Select a number of clusters and visualize 
the individual data points therein. 


Figure 2. (a) The spherical manifold in R 3 latent space; (b) The spherical manifold in R 3 data space; (c) 
Projection of data points t onto the latent spherical manifold (Figure taken from Chang, 2000). 
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If needed, one can still interact with data by 
selecting data points in a given cluster and ac- 
complish a number of comparisons and tests. 

Interactively Selecting Points on the 
Sphere 

Having projected the data on the latent sphere, a 
typical task performed by most data analyzers is 
the localization of the most interesting data points; 
for instance, the ones lying far away from denser 
areas (outliers), or those lying in the overlapping 
regions between clusters, and to investigate their 
characteristics by linking the data points on the 
sphere with their position in the original data 
set. For instance, in the astronomical application 
described later on, if the images corresponding 
to the data were available, the user might want 
to visualize the object on the astronomical image 
corresponding to the data point selected on the 
sphere. The user is also allowed to select a latent 
variable and color all the points for which that 
specific latent variable is responsible (Figure 3). 


Visualizing the Latent Variable 
Responsibilities on the Sphere 

The simple projections of the data points onto 
the sphere provide only partial information about 
the clusters inherently present in the data. For 
instance, if the points are strongly overlapping the 
user cannot derive any information at all. A first 
insight on the number of agglomerates localized 
into the spherical latent manifold is provided 
by the mean of the responsibility for each latent 
variable. Furthermore, if we build a spherical 
manifold that is composed by a set of faces, each 
one delimited by four vertices, then we can color 
each face with colors varying in intensity on the 
basis of the value of the responsibility associated 
with that given vertex (and hence, to each latent 
variable). The overall result is that the sphere will 
contain regions denser than other regions, and this 
information is easily visible and understandable. 
Obviously, denser areas of the spherical manifold 
might contain more than one cluster, and this calls 
for further investigations. 


Figure 3. Data points selection phase. The bold black circles represent the latent variables; the blue 
points represent the projected input data points. When a latent variable is selected, each projected point 
for which the variable is responsible is colored. 
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A Method to Visualize Clusters on 
the Sphere 

Once the user has an overall idea of the number 
of clusters on the sphere, he can exploit this 
information through the use of classical cluster- 
ing techniques (such as hard or fuzzy /(-means 
(Bezdek, Keller, Krisnapuram, & Pal, 1999)) to 
find out the prototypes of the clusters and the data 
therein contained. This task is accomplished by 
running the clustering algorithm on the projected 
data. Afterwards, one may proceed by coloring 
each cluster with a given color (see Figure 4). 

The visualization options so far described 
have been integrated in a user-friendly graphi- 
cal user interface that provides a unified tool for 
the training of the PPS model, and next, after 


the completion of the training phase, also to ac- 
complish all the functions for the visualization, 
characterization, and further analysis of the data 
(Staiano, 2004). 

PPS AS A CLASSIFICATION 
TECHNIQUE 

The spherical PPS may also be used as a “refer- 
ence manifold” for classifying high-D data. A 
reference spherical manifold is computed for 
each class during the training phase. In the test 
phase, a data point previously unseen by the model 
is classified to the class of its nearest spherical 
manifold. Obviously, the concept of “nearest” 
implies a distance computation between a data 


Figure 4. Clusters computed in the latent space by k-means 
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point t and the nodes of the manifold. Before 
doing this computation, the data point t must 
be linearly projected onto the manifold. Since 
a spherical manifold consists of square and 
triangular patches, each one defined by three or 
four manifold nodes, only an approximation of 
the distance is computed. The PPS framework 
provides three approximation methods: 

• Nearest Neighbor: Finds the minimal 
square distance to all manifold nodes. 

• Grid Projections: Finds the shortest projec- 
tion distance to a manifold grid. 

• Nearest Triangulation: Finds the nearest 
projection distance to the possible triangula- 
tion. 

Another way to use PPS as classifier consists in 
choosing the class C with the maximum posterior 
class probability for a given new input t. Formally 
speaking, let us suppose to have N labeled data 
points {tp.-.jtJ, with t ,<eR d and class labels in the 
set {1,...,C}, then the posterior probabilities may 
be derived from the class-conditional density p(t 
| class ) via the Bayes theorem: 

P(class 1 1) 

= IctosWctass) dass)P(dass) 
pit) 

In order to approximate the posterior prob- 
abilities p(class | t), we estimate p(t | class ) and 
p(class ) from the training data. Finally, an input 
t is assigned to the class with maximum p(class 
It). 

ENSEMBLE OF PPS 

Since PPS builds a probability density function 
as a mixture of Gaussian distributions trained 
through the EM algorithm, its performance may 
degrade with increasing data dimensionality due 


to singularities and local maxima in the log-likeli- 
hood function. Therefore, we propose two schemes 
for designing a committee of spherical PPS to 
gain improved probability density functions and 
hence, classification rates. The area of ensemble 
of learning machines is now a well-defined field 
and has been successfully applied to neural 
networks, especially in the case of supervised 
learning algorithms. Fewer cases can be found 
for unsupervised learning methodologies and 
for density estimation as well: among these, the 
works introduced in Ormoneit and Tresp (1998) 
and Smyth and Wolpert (1999); both exploits 
consolidated techniques in supervised contexts 
as stacking (Wolpert, 1992) and bagging (Brei- 
man, 1996) to density estimation, and represent 
the basis of our proposed schemes. 

Stacking probabilistic principal surfaces for 
density estimation: StPPS. The ensemble, herein 
described, may be seen as an instantiation of the 
method proposed in Smyth and Wolpert (1999). 
Let us suppose we are given S PPS models (i.e., S 
density estimators) {PPS s (t)} s=1 s , where PPS s ( t) 
is the s-th PPS model. Note that in the original 
formulation given in Smyth and Wolpert (1999), 
the S density estimators could also be of different 
types, for example, finite mixtures with a fixed 
number of component densities or kernel density 
estimate with a fixed kernel and a single fixed 
global bandwidth in each dimension. Now, go- 
ing back to our model, each of the S PPS models 
can be chosen to be diverse enough, that is, by 
considering different clamping factors, number 
of latent variables, and latent bases. In order to 
stack the S PPS models, we follow the procedure 
described next: 

1. Let D be the training data set, with size 
| D | =N. Partition D v times, as in v-fold cross 
validation. The v-th contains exactly (v-1) 
x N/v training data points and N/v test data 
points, both from the training set D. 

For each fold: 
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a. fit each of the S PPS models to the 
training subset of D. 

b. evaluate the likelihood of each data 
point in the test partition of D, for each 
of the S fitted models. 

2. At the end of these preliminary steps, we 
obtain S density estimators for each of the 
N data points that are organized in a matrix 
A of size N x S, where each entry a. s is 

ppm- 

3. Use the matrix Ato estimate the combination 
coefficients {nJ =l ...,S that maximize the 
log-likelihood at the points t of a stacked 
density model of the form: 

s 

StPPS(t) = ^7tsPPS s (t), which 

S= 1 

corresponds to maximize 

N f S N 

y.ln YnsPPSsjb) , 

1=1 V s=i , 

as a function of the weight vector (tt^ . . . 
, 7t s ). Direct maximization of this function 
is a nonlinear optimization problem. We 
can apply the EM algorithm directly by 
observing that the stacked mixture is a finite 
mixture density with weights (k , . . ., 7r s ). 
Thus, we can use the standard EM algorithm 
for mixtures, except that the parameters of 
the component densities PPS ( t) are fixed, 
and the only parameters allowed to vary are 
the mixture weights. 

4. The concluding phase consists in the param- 
eters reestimation of each of the S component 
PPS models using all of the training data D. 
The stacked density model is then the linear 
combination of the so obtained component 
PPS models, with combining coefficients 

k>.=w 

Committee of PPS via Bagging: BgPPS. The 

second ensemble proposed employees bagging 
as mean to average single density estimators, in 
our case the PPS, in a way similar to the model 


proposed in Ormoneit and Tresp (1998). All we 
have to do is to train a number S of PPS with 
S bootstrap replicates of the original learning 
data set. At the end of this training process, we 
obtain S different density estimates that are then 
averaged to form the overall density estimate 
model. Formally speaking, let D be the original 
training set of size N and (PPSs} s=J s a set of 
PPS models: 

1. Create S bootstrap replicates (with replace- 
ment) of D, (D Boot (s)} s=1 s with size N. 

2 . Train each of the S PPS models with a boot- 
strap replicate D Boo[ . 

3. At the end of the training we obtain S density 
estimates {PPS s } s=1 s . 

4. Average the S density estimates{PPS s } s=! s 
asB g PPS(t)=i|>PSs(t). 

^ S=1 

APPLICATION OF PPS TO GOODS 
DATA 

The Great Observatories Origins Deep Survey 
(GOODS) is an international project that joins 
together NASA, ESA (European Space Agency), 
and some of the most powerful ground-based 
facilities to survey the distant universe to the 
faintest flux limits across the broadest range of 
wavelengths. At the end of the project, GOODS 
will survey a total of roughly 320 square arc 
minutes in two fields centered on the Hubble 
Deep Field North and the Chandra Deep Field 
South, respectively (Dickinson, 2002). The cur- 
rently available GOODS catalogue is a catalogue 
composed by 28,405 objects. Each object has 
been measured in seven optical bands, namely 
U,B,V,R,I,J,K bands. For each band, three differ- 
ent parameters, astrometric (positions), geometric 
(i.e., Kron radius, ellipticity, etc.), and photometric 
( Flux andMagnitudes ) were measured, adding up 
to several dozens of parameters. Objects are also 
classified as angularly resolved (or galaxies, in the 
astronomical jargon) and nonresolved (or stars). 
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Moreover, GOODS (and more, in general, astro- 
nomical surveys) data present a further peculiar- 
ity: the majority of the objects are “drop outs,” that 
is, they are detected only in some bands and not 
detected in the others, due to either instrumental 
(different detection limits) or intrinsic (different 
spectral properties) reasons. Without entering into 
details, we must stress that the characterization 
of an object as a “drop out” (i.e., as an object 
with a strong relative flux difference between 
two or more spectral regions) is very important 
from the astronomical point of view, since it 
allows discrimination among different classes 
of celestial objects. From our statistical point of 
view, therefore, the data set contains four classes 
of objects, namely stars, galaxies, stars that are 
drop outs, and galaxies that are drop outs (at this 
stage, we do not take into account the number of 


bands for which an object is a drop out). In order 
to evaluate the performance of our tools, we also 
processed a reference synthetic catalogue kindly 
provided us by Maurilio Pannella (MPI- Garching, 
Germany), matching the characteristics of the 
GOODS data. This catalogue contains 20,000 
objects equally divided into two classes formed 
by 10,000 stars and galaxies, respectively. Each 
object is described by eight features (parameters); 
namely, the magnitudes in the corresponding 
eight optical filters. 

Data Visualization for GOODS 
Catalogue 

In Figure 5, we visualize the actual GOODS cata- 
logue. As it may be seen, it exhibits four strongly 
overlapping classes. As it is apparent, the PCA vi- 


Figure 5. From top left to bottom right clockwise: GOODS 3-D PCA projections, PPS projections on 
the sphere, galaxy density on the sphere, and star density on the sphere. 
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sualization gives no interesting information at all, 
since it displays only a single condensed group of 
data. In PCA, the class of dropped galaxies (whose 
objects are yellow colored), which contains the 
majority of objects (about 24,000), is near totally 
hidden. The PPS projection onto the spherical 
latent manifold appears much more readable than 
the PCA and contains much more information. 
The figure also depicts the corresponding latent 
variable probability density function. By rotating 
the sphere with density, two high-density regions 
are highlighted together with a few other regions 
of lower density. 

Classification of GOODS Data 

In the case of StPSS, we built a model in which a 
group of six different PPS models, each one with 
a fixed a value, are put together in an ensemble 
via stacking. An important parameter for stacking 
is the number v of folds in the cross validation 
procedure. In our experiments, we tried 5-fold 
and 10-fold cross validation. 

In the case of BgPPS we used, instead, a single 
PPS model with its own parameters setting, and 
to bag it in order to improve its performance. In 
our experiments, we bag 10 PPS models (for a = 

0.2, 0.4, ...2.0) in order to assess the best a value. 
The PPS models are trained on 20 bootstrap rep- 
licates of the training data set (hence, we have a 
committee of 20 PPS models whose responses 
are averaged). 


In all experiments, the classifiers run 25 times, 
and each time, new training and test data parti- 
tions (60% for training and 40% for testing) are 
generated. Moreover, for comparison purposes, 
we accomplished classification by using single 
PPS models as well, to: 

1. Compute the reference manifolds for 
each class (we denote this classifier as 
PPSRM). 

2. Compute the posterior class probability 
(hereinafter denoted as PPSPR). 

Application to Synthetic Catalogue: 
StPPS 

Parameter settings are listed in Table 1. The results 
are depicted in Figure 6, and clearly show that 
5-fold cross validation works better than 10-fold 
cross validation in both the mean classification 
error (1.34 against 1.84, respectively) and standard 
deviation (0.2606 against 0.4071, respectively). 

The minimum error reached is 1.05, as it is 
shown in Table 2, where it is also shown the cor- 
responding confusion matrix. The difference 
between 5-fold and 10 -fold cross validation could 
be explained by the fact that the size of the train- 
ing set is quite high, so a 10-fold cross validation 
may lead to overfitting problems (recall that in 
our PPS models we do not employ any regular- 
ization method). 


Table 1. Synthetic catalogue: StPPS parameter settings 


Parameters 

PPS l 

PPS 2 

PPS 3 

PPS A 

4 

PPS s 

PPS 6 

a 

t 

0.5 

3 

0.2 

0.3 

0.8 

M 

266 

266 

266 

266 

266 

266 

L 

18 

51 

51 

51 

6 

51 

Lfac 

2.2 

2 

2 

2 

2.5 

2 

iter 

too 

100 

100 

100 

100 

100 

£ 

0.01 

0.01 

0.01 

0.01 

0.01 

0.01 
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Figure 6. Synthetic catalogue: StPPS classification errors over 25 iterations 


Table 2. Synthetic catalogue: Confusion matrix 
by StPPS best result 


Classifier 

Confusion Matrix 

StPPS (1.05) 

Star Galaxy 

Star 3943 27 

Galaxy 57 3973 


Table 3. Synthetic catalogue: BgPPS parameter 
settings 


Parameter 

Value 

Description 

M 

266 

number of latent variables 

L 

60 

number of basis functions 

Lfac 

1 

basis functions width 

iter 

too 

maximum number of 

£ 

0.01 

iteration 



early stopping threshold 


Application to Synthetic Catalogue: 
BgPPS 

The parameter settings are shown in Table 3. 
On synthetic catalogue, bagging performs very 
well for values of clamping factor a between [1.0, 
2.0], where the best mean classification error 
and standard deviation results are obtained. In 
particular, for a = 2.0 

BgPPS reaches its minimum mean classifica- 
tion error (0.24) (see Figure 7). 

Application to GOODS Catalogue: 
StPPS 

In GOODS catalogue, the behavior of the stacked 
model, for which the parameters are set as in 
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Figure 7. Synthetic catalogue: BgPPS error bars over 25 iterations for each fixed a 


Table 4. Synthetic catalogue: Confusion matrix 
by BgPPS best model 


Classifier 

Confusion Matrix 

a 

BgPPS(0.05) 

Star Galaxy 
Star 3996 0 

Galaxy 4 4000 

2.0 


Table 5, is inverted in terms of 5-fold and 10-fold 
cross validation. 

In fact, here we have better results for 10-fold 
cross validation (mean classification error 2.87 
and standard deviation 0.1344) with respect to 
5-fold cross validation (mean classification er- 
ror 3.44 and standard deviation 0.4720), as can 
be seen from Figure 8. This is reasonable as the 
number of training data for the first three classes 
(S, G, and SD) are much less than the number of 
training data for class GD, so a higher number 


Table 5. GOODS catalogue: StPPS parameter settings 


Parameters 

PPS I 

PPS 2 

PPS 3 

PPS 4 

PPS s 

PPS 6 

a 

1.4 

1.2 

0.8 

0.6 

1.6 

2.0 

M 

266 

266 

266 

266 

615 

615 

L 

18 

83 

83 

83 

83 

83 

Lfac 

1 

2 

1.5 

1.1 

1.3 

2 

iter 

100 

100 

100 

100 

100 

100 

£ 

0.01 

0.01 

0.01 

0.01 

0.01 

0.01 
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of folds leads to a better fit to data. Confusion 
matrix corresponding to the minimum error (1.05) 
is shown in Table 6. 

Application to GOODS catalogue: 
BgPPS 

For the GOODS catalogue, the results are more 
fluctuating for each of the a values. In fact, the 


best results are obtained between the interval 
[0.2, 0.6] and [1.4, 2.0], as it can be seen from 
Figure 9. The overall best result falls in the 
second interval, in particular for a = 1.8 (mean 
classification error 2.74 and standard deviation 
0.3987) even though BgPSS with a = 0.6 obtains 
a lower standard deviation value (0.1725). The 
minimum classification error with confusion 
matrix is shown in Table 7. 


Figure 8. GOODS catalogue: StPPS classification errors over 25 iterations 


Table 6. GOODS catalogue: Confusion matrices by StPPS best model 


Classifier 

Confusion Matrix 



S 

G 

SD 

GD 


s 

92 

4 

2 

0 


G 

76 

1234 

2 

36 

StP P S(2.62) 

SD 

0 

0 

52 

36 


GD 

0 

8 

134 

9688 
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Figure 9. GOODS catalogue: BgPPS error bars over 25 iterations for each fixed a 


Table 7. GOODS catalogue: Confusion matrix by BgPPS best model 


Classifier 

Confusion Matrix 

a 

BgPPS(2.15) 

S G SD GD 

S 155 35 12 5 

G 

8 1160 6 8 

SD 

0 0 64 7 

GD 

5 51 108 9740 

1.8 


Synthetic Catalogue: PPSRM, 

PPSPR, StPPS and BgPPS 
Comparison 

As canbe seen fromFigure 10, BgPPS outperforms 
both PPSRM and PPSPR for near all a values. 
Moreover, from Figure 11, it is clear that BgPPS 
outperforms StPPS. This latter model performs 
best of the single model classifiers on average, 


even though PPSPR for just one a value reaches 
a better result. 

GOODS Catalogue: PPSRM, PPSPR, 
StPPS, and BgPPS Comparison 

GOODS catalogue classification task is more 
complex. The four classes are heavily overlapping 
and even in the best cases, there are classes (i.e., 
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Figure 10. Synthetic catalogue: PPSRM, PPSPR, and BgPPS mean classification errors over 25 itera- 
tions for each fixed a 


Figure 11. Synthetic catalogue: PPSRM, PPSPR, StPPS, and BgPPS best model statistics 
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Figure 12. GOODS catalogue: PPSRM, PPSPR, and BgPPS mean classification errors over 25 itera- 
tions for each fixed a 


Figure 13. GOODS catalogue: PPSRM, PPSPR, StPPS, and BgPPS best model statistics 
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S and SD) whose objects are classified with an 
error rate of about 60%. This is evident from the 
results obtained by the different used classifiers. 
However, even in this case, BgPPS outperforms 
all the other models (PPSRM, PPSPR, and StPPS). 
Moreover, StPPS here outperforms both PPSRM 
and PPSPR. Among the two single PPS classifier 
models, PPSPR is still better than PPSRM (see 
Figures 12 and 13). 


APPLICATION OF PPS TO YEAST 
GENES MICROARRAY DATA 

Gene-expression microarrays, whose development 
started in the second half of the 1990s, are having 
a powerful impact on molecular biology. In fact, 
although the ability to measure transcription of a 
single gene is not new, the possibility to measure 
the transcription of all genes, in an organism, at 
once, is a recent advance, and is leading to new 
methods of diagnosis and of treatment for a large 
number of diseases. However, it is also becom- 
ing increasingly clear that simply generating the 
data is not enough, and that the extraction of the 
relevant information is a nontrivial task. Statisti- 
cal techniques and other classical methods of data 
analysis, are not adequate and therefore, in the last 
decade, much work has focused on the develop- 
ment of machine-learning methodologies suited 
for the analysis of genetic data. Just to mention 
a few, support vector machines have been used 
for the functional classification of genes (Brown, 
Grundy, Lin, Cristianini, Sugnet, Furey, Ares, 
& Haussler, 2000); clustering techniques were 
used for grouping similar expression patterns 
across a number of experiments of all the genes 
of the yeast Saccharomyces cerevisiae (Spellman, 
Sherlock, Zhang, Iyer, Anders, Eisen, Brown, 
Botstein, & Futcher, 1998); Neural networks have 
been employed both for clustering and visualiza- 
tion of gene microarray data (Tamayo, Slonim, 
Mesirov, Zhu, Kitareewan, Dmitrovsky, Lander, 


& Golubl999; Toronen, Kolehmainen, Wong, & 
Castren, 1999) . 

In order to investigate the capabilities of PPS 
in this different field of activity, we started from 
the work of Spellman and his colleagues de- 
scribed in Spellman et al. (1998), which provides 
a comprehensive catalogue of yeast genes whose 
transcript levels vary periodically within the cell 
cycle. In order to produce the catalogue, samples 
from yeast cultures synchronized with different 
experiments were used. 

In Spellman et al. (1998) a type of agglom- 
erative hierarchical clustering (Eisen, Spellman, 
Brown, & Botstein, 1998) was used in order to 
identify clusters of genes behaving similarly in 
each experiment, and which represent groups 
of apparently coregulated genes. These clusters 
provide a solid basis for understanding the tran- 
scriptional mechanism of cell cycle regulation. 
The data set, used by us, consists of a set of 6,125 
genes, subject to four different experiments. Each 
experiment consists of measurements at different 
epochs, for a total of 73 parameters. 

Preprocessing 

In order to make this data set more apt to be pro- 
cessed with PPS, we first applied a preprocessing 
phase in which, through the use of a nonlinear 
PCA (Tagliaferri, Ciaramella, Milano, Barone, 
& Longo, 1999), we reduced each experiment to 
eight measurements, and eliminated the genes 
whose experiments had too much missing data. 
Hence, the used data set consists of 5,425 genes 
and 32 features. Furthermore, since, in general, 
microarray data is noisy, it is necessary to resort 
to some kind of cleaning procedure to identify 
those genes affected from noise process involved 
in the generation of data from microarrays. 

At this aim, we decided to train a PPS with a 
high number of latent variables, so that each one 
is responsible for a limited number of data points; 
afterward, we apply a clustering procedure on the 
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Figure 14. Yeast gene data set: (a) 3-D PCA projection; (b) Data point projections in the latent space; 
(c) Data probability density in the latent space 



Figure 15. Cluster prototype periodic behaviors and error bars (3c r) showing the standard deviations of 
genes from the prototypes for a fixed cluster. On top of each subplot, the cluster number and the number 
of genes within each cluster is reported. 
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nodes of the manifold in the data space. So doing, 
a number of identified clusters containing genes 
with low variance (i.e., genes whose transcript 
levels show a poor periodic behavior) were thrown 
away. The number of remaining genes turned out 
to be 2,761. 

PPS and Yeast Gene Data: Results 

We used a PPS with 266 latent variables and 40 
latent basis functions and a clamping factor a 
set to 0.5. After the completion of the training 
phase, we projected the data into the latent space 
and computed the responsibility for each latent 
variable, as shown in Figure 14. 

On the basis of probability density function 
visualized in Figure 14, we decided to identify 30 


F igure 1 6. PPS and Spellman cluster comparisons. 
On each row are reported the 30 PPS clusters, 
while on the columns are the clusters computer by 
Spellman. The A.. th entry of the table corresponds 
to the fraction of Spellman cluster j falling in the 
PPS cluster i. 
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0 
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0 

0 

0 

0 

0 

0 

30 

0 

0 

0 

0 

6.8966 

0 

0 
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clusters through a hierarchical clustering proce- 
dure. For each cluster, we plotted the prototype 
trend with respect to the 32 features, as it can be 
seen in Figure 15, which highlights the periodic 
behavior of each gene belonging to the clusters. 

In Figure 16, we compare the results of our 
clustering procedure with those obtained by 
Spellman et al. (1998). Before discussing the 
table, we wish to stress that the two clustering 
procedures were completely different: Spellman, 
in fact, clustered the gene properties using an a 
priori knowledge of their characteristics and thus, 
he worked with only 209 genes, while our algo- 
rithm made use only of the statistical properties 
of the data with no a priori knowledge. In spite of 
this, some remarkable patterns may be detected: 
Spellman’s cluster number 1 falls near entirely 
in our cluster 8; Spellman’s clusters number 2 
and 8 are, statistically speaking, indistinguish- 
able (together they form our cluster number 28); 
Spellman’s cluster number 5 appears to be a sort 
of statistical waste basket that groups together 
rather different clusters (7, 8, 17, 20, plus several 
others with lower significance) that, however, are 
topological neighbors in the PPS latent space, and 
can therefore be considered as “substructures” 
(missed by Spellman) of a larger cluster. Finally, 
cluster 21 contains, entirely, the genes belonging 
to Spellman’s cluster 3. The most relevant result, 
however, seems to be the fact that many (13 out of 
30) of our clusters are not mapped by any of the 
209 genes in the Spellman sample. Whether these 
clusters have or have not biological significance 
will be the subject of future studies (see Amato, 
Ciaramella, Deniskina, Del Mondo, di Bernardo, 
D., Donalek, Longo, Mangano, Miele, Raiconi, 
Staiano, & Tagliaferri, 2006). 

CONCLUSION 

The aim of this work is to propose a new tool to 
the data mining community. The tool is based 
upon probabilistic principal surfaces, and we 


262 


Advanced Data Mining and Visualization Techniques with Probabilistic Principal Surfaces 


discussed its potentiality and highlighted the flex- 
ibility it exhibits in a number of activities fruitful 
for data mining applications. For this purpose, 
the discussion focused, in particular, on PPS 
classification and data visualization capabilities, 
two important activities for data mining, by show- 
ing some results gained with data coming from 
astronomy (star/galaxy data) and genetics (yeast 
gene microarray data). Although the obtained 
experimental results are, for one case (i.e., yeast 
gene data), in a preliminary phase and must be 
validated by further experiments, it is undoubtedly 
from the results themselves what advantages and 
benefits PPS provide. 

Classification Tasks 

Even though the basic PPS model provides good 
classification performance (Chang, 2000), we 
showed that these abilities canbe further improved 
by employing PPS in an ensemble, proposing, 
specifically, two combining schemes based on 
stacked generalization and bagging. We applied 
the PPS ensemble to astronomical data, and from 
the experimental results, it can be stated that the 
committee of PPS perform better than single PPS, 
even though this is clear for the ensemble of PPS 
built via bagging. Stacked PPS, instead, has less 
stable results, but it seems a promising combining 
schema after all, since we did few experiments 
by varying the PPS component complexity. We 
rather focused on the impact of cross validation, 
which appears as of primary importance. 

Data Visualization Tasks 

The spherical PPS, which consists of a spherical 
latent manifold lying in a three- dimensional latent 
space, is better suitable to visualize high-D data, 
since the sphere is able to capture the sparsity and 
periphery of data in large input spaces, which are 
due to the curse of dimensionality. We proposed 
a number of visualization possibilities integrated 
in a user-friendly graphical user interface: 


• Interactive selection of regions of sample 
points projected into the sphere for further 
analysis. This is particularly useful to profile 
groups of data. 

• Visualization of the latent variable responsi- 
bilities onto the sphere as a colored surface 
plot. It is useful to localize more and less 
dense areas to find out a first number of 
clusters existing in the data, and to highlight 
the regions where lies outliers. 

• A method to exploit the information gath- 
ered with the previous visualization options 
through a clustering algorithm to find out the 
clusters with the corresponding prototypes 
and data points. 

The data visualization tasks have been proved 
effective for data mining in complex application 
domains: astronomical data and yeast gene mi- 
croarray data analysis. Although the study of the 
methods addressed in this chapter is devoted to the 
astronomical and genetic applications, the system 
is general enough to be used in whatever data-rich 
field to extract meaningful information. 
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ABSTRACT 

Advances in computing techniques, as well as the reduction in the cost of technology, have made pos- 
sible the viability and spread of large virtual environments. However, efficient navigation within these 
environments remains problematic for novice users. Novice users often report being lost, disorientated, 
and lacking the spatial knowledge to make appropriate decisions concerning navigation tasks. In this 
chapter, we propose the frequent wayfinding-sequence (FWS) methodology to mine the sequences rep- 
resenting the routes taken by experienced users of a virtual environment, in order to derive informative 
navigation models. The models are used to build a navigation assistance interface. We conducted several 
experiments using our methodology in simulated virtual environments. The results indicate that our ap- 
proach is efficient in extracting and formalizing recommended routes of travel from the navigation data 
of previous users of large virtual environments. 

INTRODUCTION recent times due to advances in both hardware 

and software technology. Research from various 
The design and implementation of virtual envi- disciplines, such as computer graphics, human- 

ronments (VEs) has improved significantly in computer interaction, urban design, and psychol- 
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ogy, have also contributed to the advancement 
of the field of VEs. In general, VEs provide a 
computer-synthesized world in which users can 
interact with objects, perform various activities, 
and navigate the environment as if they were in 
the real world. Applications for VEs are found in 
various domains including medicine, engineering, 
oil exploration, and the military (Burdea & Coif fet, 
2003; Grady, 2003; Sherman & Craig, 2002). 

Navigation is a fundamental activity in VEs 
(see Figure 1). Therefore, successful use of a 
VE requires that the user be able to easily and 
efficiently navigate from one location to another 
(Darken & Peterson, 2002). However, previous 
research has shown that novice users of VEs are 
often disoriented, feel “lost in hyperspace,” and 
lack the spatial knowledge needed to pick an ap- 
propriate route due to the deficiency of experience 
with the VE (Conroy, 2001; Darken & Sibert, 1996; 
Kantardizc, Rashad, & Sadeghian, 2004; Sade- 


ghian, Kantardzic, Lozitskiy, & Sheta, 2005; Sa- 
deghian, Kantardzic, Lozitskiy, & Sheta, 2006a; 
Sadeghian, Kantardzic, & Rashad, 2006b; van 
Dijk, op Den, Rieks, & Zwiers, 2003). Traditional 
tools, such as maps, have demonstrated some suc- 
cess in helping users navigate (Darken & Sibert, 
1996; Statalich, 1995); however, researchers are 
continuously looking for alternative intelligent 
navigation tools (Chen & Stanney, 1999). 

In this chapter, we are introducing the fre- 
quent wayfinding-sequence (FWS) methodol- 
ogy to derive a model of the experienced users’ 
navigation behaviors. This process is conducted 
by transforming the previously recorded naviga- 
tion data of experienced users into sequences, 
applying a modified sequence mining algorithm 
to find frequent sequences corresponding to 
preferred routes of travel, and forming a final 
model of routing rules. This model is the basis 
of a spatial navigation tool that can be used by 


Figure 1. Navigation is a fundamental activity in large VEs 
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novice users to get recommended routes of travel 
within the VE. 


RELATED RESEARCH 

One approach of structuring spatial knowledge 
about a VE is based on three distinct components: 
landmark knowledge, procedural knowledge, and 
survey knowledge (Darken & Sibert, 1996; Elvins, 
Nadeau, & Kirsh, 2001). Landmark knowledge 
represents information about the shape, size, 
color, and contextual information of landmarks, 
or memorable and distinctive objects in a VE. 
Procedural knowledge is encoded as a series of 
steps required in following a particular route. 
Landmarks play a role in procedural knowledge by 
marking decision points along a route, and helping 
a traveler recall the procedures required to get to 
a destination. Survey knowledge provides a bird’s 
eye view of a region in which object locations and 
interobject distances are encoded in terms of a 
global frame of reference. Landmarks play a role in 
survey knowledge by providing regional anchors 
with which to calibrate distances and directions 
(Chen & Stanney, 1999; Darken & Sibert, 1996; 
Elvins et al., 2001). 

Through experience and repeated exposure 
to a VE, a user gains spatial knowledge and be- 
comes efficient at wayfinding. Wayfinding is “the 
act of traveling to a destination by a continuous, 
recursive process of making route-choices whilst 
evaluating previous spatial decisions against 
constant cognition of the environment,” (Conroy, 
2001, pp. 26). 0’Neill(1992)foundthatwayfinding 
performance decreases as an environment’s com- 
plexity increases. On the other hand, an increase 
in the familiarity with a VE leads to an increase in 
wayfinding ability (Ruddle, Payne, & Jones, 1998). 
The use of different types of technology to navigate 
(Peterson, Wells, Furness, & Hunt, 1998; Ruddle, 
Randall, & Jones, 1996), the goal associated with 
a wayfinding task (Magliano, Cohen, Allen, & 
Rodrigue, 1995), and the travel techniques used 


by a traveler (Bowman, 1999) are also factors that 
effect wayfinding performance. 

Since wayfinding in VEs is not a trivial task, 
much effort has been put in developing naviga- 
tion tools. Darken and Sibert (1996) found that 
the navigation performance of users in a VE 
equipped with a map was superior to the perfor- 
mance of users in a similar VE equipped with a 
radial grid. However, users of a VE without any 
navigation tools performed the worst and had 
extreme difficulty completing navigation tasks. 
Other studies have also shown the benefit of map 
use in improving wayfinding (Statalich, 1995). 
On the other hand, some studies have found that 
people are better at finding a target location when 
using signs or narrative directions than with using 
maps (Moser, 1988; Streeter, Vitello, & Wonsie- 
wicz, 1985). Concentrating on the importance 
of landmarks in wayfinding, Elvins, Nadeau, 
and Kirsh (2001) proposed the use of worldlets, 
which are 3-D thumbnails of landmarks that can 
be interactively viewed. Their research showed 
that VEs equipped with worldlets allowed for 
more efficient navigation of the environment, 
compared with VEs equipped with textual or 2-D 
image representation of landmarks. A series of 
recent papers have proposed the use of agents to 
assist in wayfinding (Nijholt, Zwiers, & vanDijk, 
2001; van Dijk et al., 2003; van Luin, den Akker, 
& Nijholt, 2001). In general, agents provide assis- 
tance to users by offering navigation advice and 
by answering questions about navigation within a 
VE. Other innovative techniques based on the use 
of sound or natural language processing have also 
been introduced to assist users with navigation 
(Gunther, Kazman, &Macgregor, 2004; McNeill, 
Sayers, Wilson, & McKevitt, 2002). 

The emphasis of the previous research has been 
simply on helping a user of a VE get to a desired 
destination. However, there is often more than 
one possible route available to reach a desired 
destination in large VEs (Kantardzic et al., 2004; 
Sadeghian et al., 2005, 2006a, 2006b). Navigation 
tools are needed that not only help the user reach 
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the destination, but also help the user select an 
appropriate route when more than one route is 
possible. Our work addresses this concern. 

USING DATA MINING TO IMPROVE 
NAVIGATION TOOL DESIGN 

Previous experience with an environment gen- 
erally leads to an improvement in wayfinding 
ability (Chen & Stanney, 1999) and thus, much 
insight can be gained from the knowledge of 
experienced users (Peterson, Stine, & Darken, 
2000). However, few navigation tools have been 
proposed that log the movements of users of VEs 
and do any analysis on the data (Chen & Stanney, 
1999; Sadeghian et al., 2006b). 

Our objective in designing a navigation tool is 
not to simply get the user of the VE to a desired 
location (i.e., the classical wayfinding problem), 
but to lead the user via the most “preferred” route 
when more than one route is possible. Of course, 
the most “preferred” route is defined differently for 
different environments. For example, in the real 
world the “preferred” route is often the shortest 
or fastest route (Liu, 1996). However, VEs are 
used for a variety of purposes, and other criteria 
may be the driving force in selecting a route. For 
example, the recommended way to get from one 
location to another in a large VE may not be the 
shortest possible route. A longer route may be 
preferred because it is “more scenic,” “has more 
educational value,” or “has the least number of 
obstacles associated with it,” (Sadeghian et al., 
2006b; Smyth & McGinty, 2002). 

The problem is that these “preferred” routes 
are not a priori knowledge, especially to the 
novice user. Even the designers of a VE cannot 
know for sure which route out of several possible 
routes will be preferred by most of the users, or 
all the criteria that users will consider in the route 
selection process. Since these “preferred” routes 
are often not explicitly obvious, a conscious ef- 


fort must be made to find them (Sadeghian et al., 
2005, 2006a). 

In general, data-mining techniques are used to 
discover previously unknown patterns, rules, and 
relationships (Han and Kamber, 2001; Kantardzic, 
2002). Thus, methodologies based on data-mining 
techniques can be used to find these “preferred” 
routes. By mining the navigation data of previous 
experienced users, frequently used routes can be 
discovered. A model of these frequent routes, 
which are equated as the “preferred” routes of 
travel, can be developed to form as the basis for 
navigation tool design. 

W-SEQUENCES 

According to Agrawal and Srikant (1995), a se- 
quence is defined as follows: Let I = {(, i 2 , i 3 ,. . .} 
be a set of distinct attributes called items. An 
itemset is a nonempty set of unordered items. Items 
within the same itemset are assumed to occur at 
the same time. A sequence is an ordered list of 
itemsets. The goal of sequence mining is to find 
those frequent sequences that occur more than a 
specified threshold (Soliman, 2004). Applications 
of sequence mining have been found in a great 
number of fields including the health care industry, 
financial industry, telecommunication industry, 
and bioinformatics (Chan, Fan, Prodromidis, & 
Stolfo, 1999; Soliman, 2004; Stolfo, Lee, Chan, 
Fan, & Eskin, 2001). The FWS methodology 
extends the application of sequence mining to 
the domain of VEs. 

The components that make up a wayfinding- 
sequence (W-sequence) are derived from the urban 
design elements that have been popular in the 
design of VEs (Charitos, 1997; Darken & Sibert, 
1996; Ingram &Benford, 1995; Ingram &Benford, 
1996; Ingram, Benford, & Bowers, 1996; Lynch, 
1960; Modjeska & Waterworth, 2000; Steck & 
Mallot, 2000; Vinson, 1999). Specifically, the 
components of a W-sequence are derived from 
these elements: 
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1. Landmarks: A distinguishable object in 
the VE. In the FWS model, the symbol 
representing a specific landmark is denoted 
uniquely by L., where z is an integer from 
1 to nl, and nl is the number of landmarks 
within the VE. For example, L : and L 2 refer 
to the two landmarks (i.e., buildings) in the 
map pictured in Figure 2. 

2. Paths: This is a channel of movement, such 
as a walkway or a street. In the FWS model, 
the symbol representing a specific path is 
denoted uniquely by P., where i is an integer 
from 1 to np, and np is the number of paths 
within the VE. For example, P l5 P ,, P 3 , and 
P 4 refer to the four paths (i.e., streets) in the 
map pictured in Figure 2. 

3. Nodes: A node is an intersection of two or 
more paths. In the FWS model, the symbol 
representing a specific node is denoted 
uniquely by N., where i is an integer from 1 
to nn, and nn is the number of nodes within 
the VE. For example, N 2 , N 2 , N 3 , and N 4 refer 


Figure 2. The three components of a W-sequence: 
landmarks, paths, and nodes 



to the four intersections in the map pictured 
in Figure 2. 

Edges, which are defined as boundaries (e.g., 
walls), are an important element in designing 
VEs; however they are not a component of a 
W-sequence. They primarily serve to restrict 
the area of movement of the user within the VE. 
Districts are sections of an environment contain- 
ing the other 4 components. The use of districts 
will be discussed later to show how they improve 
the discovery process of the model of frequent 
W-sequences. 

Formally, we define a W-sequence as a serial 
sequence of n symbols (n > 3) representing land- 
marks, paths, and nodes occurring at consecutive 
time intervals and specifying a route in a VE as 
follow: 

WS = <X 1 ,X 2 , ...,X> (1) 

where: 

X and X n are symbols representing land- 
marks, and 

X 2 to X n ! are symbols representing landmarks, 
paths, or nodes following the rules of a valid W- 
sequence. 

The following three rules apply for forming a 
valid W-sequence (see Table 1 for examples): 

1. The beginning symbol and the ending sym- 
bol of a W-sequence must be a landmark. 

2. A W-sequence cannot have two consecutive 
symbols in it representing the same type of 
element. 

3. The symbol before and after a landmark or 
a node is always representing a path. 

A W-sequence may contain other W-sequences 
within it. For example, the W-sequence, WS 2 = < 
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Table 1. Examples of valid and invalid W-sequences for the map in Figure 2 


W-sequence 

Valid 

Invalid 

Explanation 



X 

Violates rule 1, P 3 begins the sequence, but it is not representing a 
landmark 



X 

Violates rule 2, P 3 P, are consecutive symbols representing the same type 
of element. 

< l 1 n 2 p 2 l 2 > 


x 

Violates rule 3, N 2 following L 3 is not allowed 

<l 1 p 3 n 2 p 2 l 2 > 

x 


Valid W-sequence 


L 3 P 3 N 3 P 2 L 2 P 3 L 3 >, contains the following sub 
W-sequences: 

1. ws 1 =<l 1 p 1 n 1 p 2 l 2 p 3 l 3 > ) 

2. WS 2 =<L 1 P 1 N 1 P 2 L 2 >, and 

3. WS 3 =<L 2 P 3 L 3 > 

We define the relation contained- in, denoted 
by the symbol, c=, to relate a W-sequence and its 
subsequences. From this example, WS 3 c WS 3 , 
WS 2 c WSj, and WS 3 c WS r Also, we define the 
operation of concatenation, denoted by the symbol, 
+, to join two W-sequences together (eliminating 
only the repeating landmark at the connection 
point) to form another W-sequence. From the 
example above, WS 3 = WS 2 + WS 3 . 

Finally, it is important to note that a valid W-se- 
quence corresponds to a route from one landmark 
to another landmark. For example, WS 3 = < L ; P 
N 3 P ,L ,P 3 L 3 >, corresponds to the route “Start at 
landmark L , travel on path P , at node N switch 
to path P 2 , travel on path P 2 , reach landmark L 2 , 
travel on path P 3 , end at landmark L 3 .” The rules 
of a valid W-sequence have been specifically for- 
mulated in order to guarantee unique translation 
of a W-sequence to valid route directions. 

FREQUENT W-SEQUENCE 
METHODOLOGY 

We focus on designing a tool to help users navi- 
gate from one landmark to another via the most 


recommended route. To accomplish this task, we 
propose the frequent wayfinding-sequence (FWS) 
methodology. In this methodology, a modified 
frequent sequence mining algorithm is used to 
mine the sequences representing the navigation 
of previous experienced users, which we refer to 
as W-sequences. The result is a model of frequent 
W-sequences that can be used to derive routing 
rules from one landmark to another. Finally, an 
interface is provided for the new users in order 
to access the discovered routing rules. 

In order to clarify our discussion and provide 
examples, we introduce a simple virtual city. A 
2-dimensional map of this city can be found in 
Figure 3. As can be seen, the city occupies a 12 X 
12 grid with 4 landmarks, 11 paths, and 8 nodes 
(i.e., intersections). Each one of the 144 cells is 
either occupied by one of the basic elements (e.g., 
landmark), a part of an element (i.e., path), or is 
empty, which means that it is an edge (e.g., a wall). 
For simplicity and without losing on model gen- 
eralization, we assume that the third dimension 
of all cells is zero within this virtual city. Thus, 
all coordinates corresponding to the cells are of 
the form (x, y). However, the FWS methodology 
would operate exactly the same even if the third 
dimension was not zero. 

Formulating W-Sequences 

Most VEs record the movement of the users as a 
series of coordinates sampled at a specified rate. 
Pattern-matching techniques relating recorded 
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Figure 3. A2-D map of a simple virtual city 



coordinates to W-sequence components need to 
be used in order to formulate W-sequences. The 
VEs that we are developing to support the FWS 
methodology are designed with an underlying 
3-dimensional grid made up of a series of cells 
(Sadeghian et al., 2005, 2006a). Each cell is 
uniquely identified by a 3 -dimensional coordinate 
and is associated with one of the basic design 
elements. A database is available relating each 
element to its cell coordinate(s). When a user is 
navigating within one of the VEs, at any one point 
in time he/she is located in one of the cells. The 
fundamental move that can be made is to move 
to another neighboring cell. Such a move can be 
made if the neighboring cell is associated with any 
element except an edge. Thus, the movements of 
a user can be recorded as a set of coordinates. 

The first step in translating the set of coor- 
dinates into a W-sequence is to simply match 
each recorded coordinate with the symbol for 
the corresponding element using the coordinate 
database. Next, some preprocessing is necessary 


to transform the resulting string of symbols into 
a valid W-sequence. For example, any symbol 
(e.g., P) preceding the first and following the 
last landmark needs to be eliminated. Also, 
consecutive symbols that are identical need to be 
condensed into just one occurrence of the symbol. 
Table 2 shows the process of translating a set of 
cell coordinates representing the navigation of 
a user in the virtual city in Figure 3 into a valid 
W-sequence. 

Pruning the W-Sequences 

The next step of the FWS methodology involves 
pruning the W-sequences to eliminate “noisy” 
data. One form of pruning that takes place is the 
elimination of redundant symbols. This means 
condensing a string of symbols of the form P. N. 
P to simply P,. No “information” is lost by this 
modification; rather, the W-sequence that goes 
through this type of pruning is more conducive 
for direction generation. For example, P N P 
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Table 2. Translation of a set of coordinates into a valid W-sequence 


Action 

Result 

Recording of coordinates 

(10,5), (10,4), (10,3), (11,3), (12,3), (12,4), (12,5), (12,6), (12,7), (12,8), (12,9), 
(11,9), (10,9), (10,8), (10,7), (10,6), (10,5), (10,4), (10,3), (9,3), (8,3), (7,3), 

(6,3), (6,2), (6,1), (5,1), (4,1), (3,1), (2,1), (2,2), (2,3), (2,4), (1,4), (1,5), (1,6), 
(1,7), (1,8), (1,9), (1,10), (2,10), (3,10) 

Translation of coordinates 
into string of symbols 

< P a P a L 3 P 9 P 9 p 9 p 9 p 9 p 9 p 9 p 9 p 9 L 4 P 8 P a N r P a P a L 3 P g P a N a P a P e N ! P 7 P 7 P 7 P 7 

L 1 P 1 P 1 P 1 P 1 P, P t P^P^V 

Elimination of symbols before 
first and last landmark symbol 

< L 3 p 9 p 9 p 9 p 9 p 9 p 9 p 9 p 9 p 9 V P 8 P a N 7 p 8 P 8 L 3 p 6 p 6 N 5 P 6 P 6 N, p 7 p 7 p 7 p 7 V P, 
p 1 p 1 p 1 p 1 p 1 p 1 p 1 p 1 l 2 > 

Elimination of identical consecutive symbols 

< L 3 p 9 L 4 P a N 7 P a L a P e N b P a N ! P 7 L i P , L a > 


would translate to “travel on path P |; at node 
N , continue to travel on path P ” This would be 
more simply stated as “travel on path P r ” Since 
landmarks are crucial to wayfinding, they are 
never eliminated. 

A loop is defined as a W-sequence where the 
starting landmark symbol and the ending land- 
mark symbol are the same. In terms of navigation, 
this corresponds to starting at a specific landmark 
and, through a series of movements, ending back 
up at the same landmark. In a VE, this often 
translates to backtracking and disorientation in 
the space (Conroy, 2001; Darken & Sibert, 1996). 
The F WS methodology finds these loops, registers 
them in a database, and eliminates them from the 
W-sequences before the mining process. Find- 
ing and registering loops is, in itself, a form of 
knowledge extraction (Sadeghian et al., 2006b). 
Information about frequent loops discovered can 
later be made available to users of the system to 
help them avoid the mistakes of previous users. 
Information on loops is also beneficial to the de- 
signers of the system, allowing them to improve 
the design of their system. 

Table 3 shows the result of pruning the W- 
sequence in Table 2. First, the redundant nodes 
and paths (i.e., P 8 N ? P g , N ) are eliminated. Then 
the loop (i.e., < L 3 P 9 F 4 P 8 F 3 >) is registered and 
eliminated from the W-sequence. The resulting 
W-sequence corresponds to the route “Start at 
landmark L 3 , travel on path P g , at node N 3 switch 


to path P 7 , travel on path P 7 , arrive at landmark 
L , travel on path P , end at landmark L ”. 

Mining for W-Sequences 

A sequence-mining algorithm is executed to find 
all the W-sequences within the preprocessed 
W-sequence database (Sadeghian et al., 2006a). 
While examining a W-sequence, all W-sequences 
contained within it (i.e., sub-W-sequences) are 
noted and counted. We assume that the paths 
within a VE allow for two-way travel. Therefore, 
a W-sequence and its reverse are representing the 
same unique W-sequence. For example, <L 8 P 
F 2 > is the same as < L P F 3 >. 

The end result of the mining process is the 
formation of a data structure with entries for 
landmark pairs of the form F. - L. where i ± j, 1 
< i < nl-1 and i <j < nl. Each L. - L. entry will 
contain all the W-sequences starting with the 
symbol L. and ending with the symbol L mined 
from the W-sequence database, their correspond- 


Table 3. Pruning the W-sequence in Table 2 


Action 

Result 

Elimination of 
redundant symbols 

<L3 P 9 L4 P aL3 P 0 N 1 P 7 L 1 P I L 2 > 

Elimination of 
the loop 

<l 3 p 6 n 1 p 7 l 1 p 1 l 7 > 
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ing count of occurrence, and support. The support 
for a W-sequence starting with L. and ending with 
L. is defined as the percentage equal to the value 
of its count of occurrence over the total number 
of occurrence of all W-sequences for the L. - L. 
pair. Table 4 provides an example of such a data 
structure found after mining a randomly generated 
W-sequence database for the VE in Figure 3. Note 
that for some L. - L. pairs, more than one unique 
W-sequence is found. The W-sequences are fol- 
lowed by the corresponding count of occurrence, 
support, and the corresponding lower limit of the 
binomial confidence interval with a = .05 

Discovering the Model of Frequent 
W-Sequences 

A frequent W-sequence for the pair L. - L meets 
the following two conditions: 

1. It has a statistical confidence above a pre- 
defined threshold value. 

2. It has the highest statistical confidence 
among all other W-sequences for the pair 
L -L. 

' j 

In our experiments, we used the lower limit 
of the binomial confidence as our measure of 


statistical confidence. Since it is possible that 
for some L. - L. pairs there will not be any W- 
sequences above the predefined threshold, it is 
not guaranteed that for each possible L. - L. pair, 
there will be a frequent W-sequence. 

If a frequent W-sequence is not found for a pair 
L. - L., it may be possible to combine two other 
frequent W-sequences to derive one. This can be 
done by searching for a landmark L k , where k is 
an integer and 1 <k< nl, such that there exists 
frequent W-sequences for the pairs L. - L k and 
L k - L . The frequent W-sequence for the pair L. 

- L. is derived from the following concatenation 
process: 

(L;“ Lj) FWS = (Lj - L k ) FWS + (L k - L.) fws (2) 

If there is more than one such L, available, 
then the L k in which the smaller confidence of (L. 

- L k)pws and (V L j)Fws is hi 8 her than the smaller 
confidence of any other (L. - L k ) pws and (L k - L.) FWS 
should be selected. The proposed logic could be 
extended to consider combing more than two 
frequent W-sequences at a time using dynamic 
programming techniques, but the problem quickly 
becomes computationally expensive (Goodrich & 
Tamassia, 2001). Figure 4 summarizes the process 
described for finding the frequent W-sequence 
for a pair F.- L . 


Table 4. W-sequences mined from a hypothetical database for the VE in Figure 3 


L - L 
' j 

W-sequences 

Count 

Support 

Confidence 

F-G 

< l 1 p 1 l 2 > 

11 

18.33 % 

.0952 

< l 1 p 2 l 2 > 

20 

33.33% 

.2169 

<l 1 p 2 n, p 4 n 4 p 3 n 3 p 2 l,> 

29 

48.33% 

.3523 

V L 3 

<L 1 P 7 L 3 > 

10 

25.00% 

.1269 

< l 1 p 7 n 1 p 6 l 3 > 

30 

75.00% 

.5880 

l ,-l 4 

< <l) > 

0 

0% 

0 

l 2 -l 3 

< L P N P N P N P N P L > 

22 33 44 65 563 

2 

100% 

.1581 

l 2 -l 4 

<L 2 PnL 4 > 

5 

100% 

.4782 

l 3 -l 4 

<L 3 P 8 L 4 > 

11 

29.73% 

.1587 

<L 3 P 9 L 4> 

26 

70.27% 

.5302 
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Figure 4. Deriving the frequent W-sequence for a pair L. - L.. 


Given: A set S of n unique W-sequences for the pair L ± - Lj 
such that S = {(L ± - Lj) 1; (L ± - Lj) 2 , . . . (L ± - Lj) n } 

Find: (Li - Lj) t c S such that 
Confidence ((Li - Lj) t ) > threshold value & 

Confidence ((Li- Lj) t ) > Confidence ((Li- Lj) m ) 

V t, m e [l,...n] 
if (Li- Lj ) t ^ <> { 

(Li- Lj) FWS = (Li- Lj) t 
return ((L ± - Lj) FWS )} 
else { 

Find: (Li - L k ) FWS * (|> and (L k - Lj) FWS * (|) such that 

Min (Confidence ((L ± - L k ) FWS ), Confidence ((L k - Lj) FWS )) > 

Min (Confidence ((L ± - L m ) FWS ), Confidence ((L m - Lj) FWS )) 

V k, m e [1 ,...,nl] 
try{ 

(Li- Lj) fws= (Li - L k ) FWS + ( L k - Lj) fws 
return ((L ± - Lj) FWS )} 
catch{ 

return (<!>)}} 


Table 5. The FWS model 


L.-L. 

• j 

FWS 

Confidence 

L.-L, 

<l 1 p 2 n 2 p 4 n 4 p 3 n 3 p 2 l 2 > 

.3523 

L.-L-, 

< l 1 p 7 n 1 p 6 l 3 > 

.5880 

l,-l 4 

<l 1 p 7 n 1 p 6 l 3 p 9 l 4 > 

.5302 

l 2 -l 3 

< L 2 P,1 l 4 p„l 3 > 

.4782 

l 2 -l 4 

<l 2 p u l 4 > 

.4782 

l 3 -l 4 

< l 3 p 9 l 4 > 

.5302 


Table 5 shows the FWS model when the 
threshold value is set to 0.25 for the W-sequences 
found in Table 4. For L.- L. pairs (i.e., L 4 and 
L 2 — L 3 ) where no frequent W-sequence was found, 
other frequent W-sequences are concatenated to 
derive a frequent W-sequence for the particular 
L. - L. pair. The assigned confidence is the lower 
confidence of the two frequent W-sequences that 
were concatenated. The FWS model represents 
the model of the navigation behaviors of the ex- 
perienced VE users. The recommended routes of 


travel are developed by translating the frequent 
W-sequences into informative directions. 

Districting 

The concept of districts can be used to improve 
the discovery process of the model of frequent 
W-sequences (Sadeghian et al., 2006a). Given a 
VE, the nl landmarks are divided into nd districts, 
where nd is an integer such that nd > 1. A specific 
district is denoted by the symbol D., such that 1< i 
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< nd. Each district D contains nZ n . landmarks such 
that 2< nl D . < nl, and D. D D. = (() V i, j e [1,. . . 
nd]. The process of assigning a certain land- 
mark to a certain district involves consideration 
of the spatial distribution of the landmarks and 
experimentation with the VE. Our experiments 
showed that proximity of the landmarks, and the 
number of paths and nodes between landmarks, 
are important factors of consideration. 

After deriving the districts, the FWS methodol- 
ogy operates by finding the FWS model for each 
district. In addition, special frequent W-sequences, 
termed as C-sequences (connection-sequences), 
are introduced. A C-sequence for the pair D.- D. 
, (D. ^ D.), is defined as a frequent W-sequence 
starting with L u , L e D. and ending with L y , L v e 
D. , such that its confidence is higher than any other 
such frequent W-sequence between the selected 


districts. C-sequences allow for the derivation 
of frequent W-sequences, where the landmarks 
represented by the beginning and ending symbols 
are in different districts. For example, given a C- 
sequence for the pair of districts D. - D., say(F u 
-LT C , and a FWS model for both districts 
D. and D., a frequent W-sequence for the pair L 
- F where L s D and F e D is derived by the 

t S 1 t J J 

following concatenation process: 

" L ( ) fws — (L s - L u ) fws + (L u - L v ) C Sequence + (L y 
- L t )rws P) 

Districting has the advantage of requiring the 
discovery of a smaller number of frequent W-se- 
quences to build a complete FWS model for the 
entire VE. The minimum number of C-sequences 
in a VE divided up into nd linearly connected 


Figure 5. Interface of a navigation tool designed using the FWS methodology 
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districts that need to be found in order to allow for 
travel between districts is defined as follows: 

Minimum number of C-sequences - nd - 1 

( 4 ) 

Equation (5) shows the total number of frequent 
W-sequences (TFW) needed to be discovered for 
the entire VE, assuming that the minimum number 
of C-sequences for its nd districts has already been 
found. Note, nd = 1 means that no districting is 
involved. Equation 6 defines the percentage of the 
frequent W-sequence (PFW) model discovered 
for an environment with nd districts. 

i = nd 

TFW nd = I n/ fli C 2 (5) 

i =1 

PFW nd = (nf / TFWJ *100% (6) 

where: 

n/ Di C 2 =(n/ D; )!/((n/ Di -2)!*(2)) 

nf is the total number of frequent W-sequences 
registered in the models 

Recommendation of a Route 

The final step is to provide an interface so that 
the VE users can benefit from the knowledge 
extracted through the FWS methodology. Figure 
5 shows an example of such an interface. 

The user is provided with a traditional 2-D map 
of the environment, as well as the choice to pick 
the desired destination (i.e., ending landmark). 
When the choice is submitted, a recommended 
route corresponding to the frequent W-sequence 
for the landmark closest to the current location 
and the destination landmark is displayed. The 
interface also warns the user of common naviga- 
tion mistakes that may be made while traveling 
between the chosen landmarks. These common 
navigation mistakes correspond to the loops that 
were registered earlier. In summary, this naviga- 


tion tool helps the user choose a route of travel 
from the many available possibilities. 

EXPERIMENTS AND DISCUSSION 

In our previous papers, we discussed a real-world 
experiment designed to explore the usefulness 
and the applicability of the FWS methodology 
(Sadeghian etal., 2005, 2006a). Our results showed 
that users of a VE equipped with an FWS-based 
tool performed better at navigation tasks than 
users of a similar VE equipped with a traditional 
2-D map. Furthermore, the use of the FWS-based 
tool also led to the improvement in the quality 
of the human-computer interaction with the VE 
(Sadeghian et al., 2005, 2006a). 

In this chapter, we will discuss several simu- 
lation experiments of large VEs. The purpose 
of our simulation experiments was to study the 
scalability of the FWS methodology. In particular, 
we wanted to study how the size of the navigation 
data available, the complexity of the environment 
(i.e., number of landmarks, paths, nodes), and the 
concept of districting would influence the FWS 
process. We developed a Java simulation program 
that takes, as input, the following parameters: the 
number of landmarks, the number of paths, the 
number of nodes, and the number of navigation 
records to be simulated. The simulation program 
generates a model of a VE with the desired number 
of elements, assigns weights to the paths based 
on a normal distribution, and outputs the desired 
number of navigation records of travel through 
the VE. We ran the simulation to produce four 
different VEs with increasing TFW : value (See 
equation (5). Table 6 gives a description of the 
four simulated large VEs. 

Loop Analysis 

Our first focus was to observe the loop registra- 
tion process for the different VEs. For each one 
of the VEs, there were 13 different size databases 
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Table 6. Description of the four simulated large VEs 


ID 

nl 

np 

nn 

TFW, 

Environment A 

10 

30 

10 

45 

Environment B 

50 

150 

50 

1225 

Environment C 

100 

300 

100 

4950 

Environment D 

500 

1500 

500 

124750 


of navigation records produced. Each navigation 
record corresponded to the navigation of a “user” 
who is visiting five landmarks in the given VE. 
The “user” is free to start at any landmark and 
move in any direction, although movements 
through certain paths are preferred over others 
due to the weights randomly assigned to the paths. 
In this experiment, we defined a loop as signifi- 
cant if the count of occurrence is five or more. A 
nonsignificant loop occurs at least once, but less 


than five times in the database. Figure 6 shows 
the number of significant loops discovered in each 
database for the four VEs, while Figure 7 shows 
the number of nonsignificant loops discovered in 
each database for the four VEs. 

The number of both nonsignificant and signifi- 
cant loops increases as the sizes of the database 
increases for a given VE. This is because as 
there is more data available, there will be higher 
probability of extracting loops. An interesting 


Figure 6. Significant loops for the VEs 



Figure 7. Nonsignificant loops for the VEs 
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observation is that the more complex the envi- 
ronment, the more records required before the 
initial discovery of significant loops. This is 
because the bigger space, the more opportunity 
for making different types of loops and therefore, 
less likely that the same loop will get repeated 
enough times to become significant. However, as 
a sufficient number of records become available, 
the number of significant loops extracted from 
a complex environment becomes much greater 
than for a less complex environment given the 
same size database. 

FWS Model Discovery 

In our next set of experiments, the navigation 
records contained in different size databases for 
the four VEs were mined for W-sequences. In 
the experiments, we set the threshold value of 
the confidence measure to 0.25 (lower limit of the 
binomial confidence interval) with a = .05 (95% 
confidence). A frequent W-sequence for a pair L. 
- L is only registered in the FWS model if, 

1. Its confidence is equal or greater than 0.25 

and it has the highest statistical confidence 

among all other W-sequences for the pair 

L - L , or if 
' j 

2. It is derived from concatenating two previ- 
ously registered frequent W-sequences. 


Our focus with this set of simulation ex- 
periments is in determining the percentage of 
the model of frequent W-sequences that could 
be discovered and how districting could be used 
to improve the discovery process. 

Each navigation record for Environment A 
consisted of up to five landmarks before the loop 
elimination process. Figure 8 shows that approxi- 
mately 100 navigation records are needed to dis- 
cover all 45 frequent W-sequences corresponding 
to 100 percent of the FWS model. Therefore, it 
is not necessary to have more than one district. 
However, this is not the case for more complex 
environments. 

Each navigation record for the remaining three 
VEs consisted of up to 20 landmarks before the 
loop elimination process. For each environment, 
the FWS process was applied for several different 
size W-sequence databases at different values of 
nd (number of districts). A value of nd - 1 means 
that no districting is involved. Figure 9, Figure 
10, and Figure 11 show the differences in model 
generation for the different W-sequence databases 
and nd values for the three VEs. Analysis of these 
graphs leads to some general observations. 

For any given environment with any value of 
nd, the more navigation data that is available, the 
greater the percentage of the model found. This is 
because an increase in the amount of data leads 
to an increase in the probability of the same W- 


Figure 8. Discovering the FWS model for Environment A 
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Figure 9. Discovering the FWS model for Environment B 


Environment B (50 Landmarks) 



- 1 District (50 Landmarks) 

- 2 Districts (25 Landmarks 
Each) + 1 Connection 

- 5 Districts (10 Landmarks 
Each) + 4 Connection 


Figure 10. Discovering the FWS model for Environment C 


Environment C (100 Landmarks) 



— 1 District (100 Landmarks) 

— 2 Districts (50 Landmarks 
Each) + 1 Connection 

— 5 Districts (20 Landmarks 
Each) + 4 Connections 

— 10 Districts (10 Landmarks 
Each) + 9 Connections 


Number of Navigational Records (Log Scale) 


Figure 11. Discovering the FWS model for Environment D 
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sequence frequently occurring in the database. For 
example, only 0.7% of the model is discovered for 
Environment B ( nd = 1) when only 10 navigation 
records are available, while 46.2% of the model 
is discovered when 10,000 navigation records are 
available. Similar results are seen tor Environment 

A, Environment C, and Environment D. 

Another general observation is that the more 

complex an environment, the smaller amount of 
the model discovered as compared to a less com- 
plex environment, when given the same amount 
of navigation data and a constant nd value. For 
example, 10,000 navigation records leads to the 
discovery of only 3.7% of the model for Environ- 
ment D ( nl = 500, nd = 1), but to 21.9% of the 
model for Environment C (nl = 100, nd = 1), 46.2 
% of the model for Environment B (nl = 50, nd 
= 1), and 100% of the model for Environment A 
(nl = 10, nd = 1). 

Unlike Environment A, the entire model of 
the frequent W-sequences cannot be discovered 
when districting is not used (nd = 1 ) for the other 
environments whatever is the size of the database. 
The simulation results show that districting is an 
effective way to improve the model discovery 
process. For any given environment, more districts 
correspond to a greater percentage of the model 
discovered, given the same size database. For 
example, given approximately 1,000 navigation 
records, 100% of the model is discovered when 
nd - 5 for Environment B as opposed to the 37% 
when nd = 1. Similar results are seen for Environ- 
ment C and Environment D. The more complex 
the environment, the greater the amount of data 
and districts that is needed to discover the entire 
model, as compared to a less complex environ- 
ment. For example, approximately 5,000 naviga- 
tion records and nd = 10 is needed to discover the 
entire model for Environment C, while the entire 
model is discovered with approximately 1,000 
navigation records and nd - 5 for Environment 

B. However, even for very complex environments 


(e.g., Environment D), the use of districts eventu- 
ally leads to the discovery of 100 percent of the 
FWS model. 

It is important to note that districting has 
some disadvantages associated with it. Since it is 
necessary to initially find the minimum number 
of C-sequences, a number of navigation records 
would have to be first processed for this purpose. 
This explains “the delay” seen in the graphs 
where more than one district is involved before 
any percentage of the model is found. The more 
districts involved, the greater number of initial 
records needed to find C-sequences, and thus, 
the longer “the delay.” The other disadvantage 
of districting is that as the number of districts 
increases, so does the probability that any two 
landmarks are not in the same district. Therefore, 
more frequent W-sequences need to be derived 
by concatenating other frequent W-sequences 
and C-sequences. This in turn has the potential 
of lowering the quality of the final FWS model, 
since a smaller percentage of the frequent W- 
sequences were explicitly mined. 

Despite these issues, we conclude that district- 
ing is an effective way to improve the discovery 
process of the FWS model. As we have shown, 
the FWS methodology is able to discover the 
necessary frequent W-sequences for even com- 
plex environments through the use of districts. 
Therefore, the FWS methodology is an effective 
approach to extract the spatial knowledge needed 
to design an intelligent navigation assistance 
system for complex VEs. 

CONCLUSION 

Users of large VEs often have problems efficiently 
navigating. We proposed the FWS methodol- 
ogy to derive a model of the experienced users’ 
navigation behaviors. The model is used to build 
a navigation tool to recommend routes to novice 
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users of a VE. Our experiments showed the seal- 
ability of the approach, and its ability to derive 
a complete navigation model even for very large 
and complex VEs. 

In the next stage of our research, we plan 
to improve the route recommendation process. 
Specifically, we plan to enhance the interface of 
the FWS-based tool with an interactive image 
preview of the most salient characteristics of a 
recommended route. Furthermore, we plan to 
personalize the route recommendation process. 
Personalization will be realized by the introduc- 
tion of informative user profiles that capture the 
route selection factors deemed as important by the 
user. Routes matching the user’s profile will be 
recommended. The addition of these techniques 
will enhance the FWS methodology’s ability to 
further improve the quality of the human-com- 
puter interaction within large VEs. 
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ABSTRACT 

Knowledge discovery is a compute- and data-intensive process that allows for finding patterns, trends, 
and models in large datasets. The grid can be effectively exploited for deploying knowledge discovery 
applications because of the high performance it can offer and its distributed infrastructure. For ef- 
fective use of grids in knowledge discovery, the development of middleware is critical to support data 
management, data transfer, data mining and knowledge representation. To such purpose, we designed 
the Knowledge Grid, a high-level environment providing for grid-based knowledge discovery tools and 
services. Such services allow users to create and manage complex knowledge discovery applications, 
composed as workflows that integrate data sources and data-mining tools provided as distributed grid 
services. This chapter describes the Knowledge Grid architecture and describes how its components 
can be used to design and implement distributed knowledge discovery applications. Then, the chapter 
describes how the Knowledge Grid services can be made accessible using the open grid services archi- 
tecture (OGSA) model. 


INTRODUCTION 

Knowledge discovery in databases (KDD) is of- 
ten both a compute- and data-intensive process. 


When large datasets are coupled with geographic 
distribution of data, users, and systems, a variety 
of technologies must be combined for implement- 
ing high-performance distributed knowledge dis- 
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covery systems. Most of the current off-the-shelf 
KDD environments require central aggregation 
of data that, in many cases, is distributed. Data 
storage in a single site may not always be feasible 
because of limited network bandwidth, security 
concerns, scalability problems, and other practi- 
cal issues. 

Data mining in large settings like virtual 
organization networks, the Internet, corporate 
intranets, sensor networks, and the emerging 
world of ubiquitous computing, questions the 
suitability of centralized KDD architectures for 
large-scale knowledge discovery in a networked 
environment. The field of distributed KDD offers 
an alternative approach. It works by analyzing 
data in a distributed fashion, and pays particular 
attention to the trade-off between centralized 
collection and distributed analysis of data. 

When the datasets are large, scaling up the 
speed of the KDD process is a crucial issue. 
Distributed knowledge discovery techniques 
address this problem by using high-performance 
multicomputer machines and a decentralized ap- 
proach for mining large datasets that can be used 
when several interconnected machines are avail- 
able for running distributed data-mining models. 
The increasing availability of such machines 
and networks calls for extensive development 
of data-analysis algorithms able to scale with 
datasets, measured in terabytes and petabytes, on 
distributed and parallel machines with hundreds 
orthousands of processors. Knowledge discovery 
is speeded up by executing, in a distributed way, a 
number of data mining processes on different data 
subsets, and then combining the results through 
metalearning. This technology is particularly 
suitable for applications that typically deal with 
very large amounts of data (e.g., transaction 
data, scientific simulation, and telecommunica- 
tion data) that cannot be analyzed in a single 
site on traditional machines in acceptable times. 
Moreover, parallel data-mining algorithms can 
be a component of distributed data-mining ap- 


plications that can exploit either parallelism or 
data distribution. 

Grid technology integrates both distributed 
and parallel computing; thus, it represents a critical 
infrastructure for high-performance distributed 
knowledge discovery. Grid computing is receiv- 
ing increasing attention both from the research 
community and from industry and governments, 
looking to this new computing infrastructure as 
a key technology for solving complex problems 
and implementing distributed high-performance 
applications (Foster, Kesselman, Nick, &Tuecke, 
2002). Today there is a large number and vari- 
ety of grid tools and middleware that allow the 
user community to use grids for implementing a 
larger set of applications, with respect to 1 or 2 
years ago. 

The term “grid” defines a global distributed 
computing platform through which — like in a 
power grid — users gain ubiquitous access to a 
range of services, computing, and data resources. 
The driving grid applications are traditional 
high-performance applications, such as high- 
energy particle physics, and astronomy and 
environmental modeling, in which experimental 
devices create large quantities of data that require 
scientific analysis. 

Grid computing differs from conventional 
distributed computing because it focuses on 
large-scale resource sharing, offers innovative ap- 
plications, and, in some cases, it is geared toward 
high-performance systems. Although originally 
intended for advanced science and engineering 
applications, grid computing has emerged as a 
paradigm for coordinated resource sharing and 
problem solving in dynamic, multi-institutional 
virtual organizations in industry and business. 
Therefore, today’s grids can be used as effective 
infrastructures for distributed high-performance 
computing and data processing. Grid applications 
include: 

• Intensive simulations on remote supercom- 
puters. 
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• Cooperative visualization of very large 
scientific data sets. 

• Distributed processing for computationally 
demanding data analysis. 

• Coupling of scientific instruments with 
remote computers and data archives. 

In the last decade, toolkits and software envi- 
ronments for implementing grid applications have 
become available. These include Condor (http:// 
www.cs.wisc.edu/condor), Legion (http://legion. 
virginia.edu), Unicore (http://www.unicore.org), 
and the Globus Toolkit (http://www.globus.org/ 
toolkit). In particular, the Globus Toolkit is the 
most widely used middleware in scientific and 
data-intensive grid applications, and is the de 
facto standard for implementing grid systems. The 
toolkit addresses security, information/discovery, 
resource- and data-management, communication, 
fault- detection, and portability issues. Today, 
Globus and the other grid tools are used in many 
projects worldwide. Although most of these proj- 
ects are in scientific and technical computing, 
there is a growing number of grid projects in 
education, industry, and commerce. 


Together with the grid shift towards industry 
and business applications, a parallel shift toward 
the implementation of data grids has been reg- 
istered. Data grids are designed to allow large 
datasets to be stored in repositories and moved 
about with the same ease as small files can be 
moved. They represent an enhancement of com- 
putational grids, driven by the need to handle 
large datasets without repeated authentication, 
and aiming at supporting the implementation of 
distributed data-intensive applications. 

The grid can be effectively exploited for imple- 
menting information intensive and knowledge 
discovery applications. To support this class of 
applications, tools and services for data mining 
and knowledge discovery on grids are essential. 
This objective can be achieved through the de- 
velopment of techniques and tools for supporting 
data intensive applications, and the integration of 
data and computational grids with information and 
Knowledge Grids (see Figure 1). The process of 
unification of data management and knowledge 
discovery systems with grid technologies for pro- 
viding knowledge-based grid services can bring 
many benefits to science and industry. 


Figure 1. Combination ofKDD and grid technologies for building a Knowledge Grid 
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Massive amounts of data are today produced 
and stored in digital archives. We are able to store 
petabytes of data in databases and query them 
at an acceptable rate. However, the extraction of 
hidden information and knowledge from huge 
amounts of stored data can make data ownership 
more competitive. 

Grids represent a good opportunity to handle 
very large datasets distributed over a large num- 
ber of sites. At the same time, grids can be used 
as knowledge discovery engines and knowledge 
management platforms. To effectively use grids 
for such high-level knowledge-based applications, 
models, algorithms, and software environments 
are needed. 

The Knowledge Grid is a high-level system 
developed for providing grid-based knowledge 
discovery services (Cannataro & Talia, 2003). 
Services for knowledge discovery on grids have 
been implemented on top of generic grid services 
provided by the Globus Toolkit. These services 
allow researchers, professionals, and scientists to 
create and manage complex knowledge-discovery 
applications composed as workflows that integrate 
data sources, mining tools, computing, and storage 
provided as distributed services. Knowledge Grid 
facilities allow users to compose, store, share, and 
execute these knowledge discovery workflows, 
as well as publish them as new components and 
services on the grid. 

The knowledge building process in a dis- 
tributed setting involves collection/generation 
and distribution of data and information, fol- 
lowed by collective interpretation of processed 
information into “knowledge.” The Knowledge 
Grid provides high-level abstractions and a set 
of services based on the use of grid resources to 
support all the phases of the knowledge discovery 
process. Therefore, it allows end users to focus 
on the knowledge discovery process they must 
develop, without worrying about grid infrastruc- 
ture and fabric details. This chapter describes 
knowledge discovery services and features of the 
Knowledge Grid environment. First, we discuss 


knowledge discovery services, present the system 
architecture, and describe how its components 
can be used to design and implement knowledge 
discovery applications for science, industry, and 
commerce. Then we describe how the Knowledge 
Grid services can be made accessible under the 
open grid services architecture ( OGSA ). Finally, 
some knowledge discovery applications we imple- 
mented on the Knowledge Grid are outlined. 

KNOWLEDGE DISCOVERY 
SERVICES 

Today many organizations, industries, and scien- 
tific centers produce and manage large amounts 
of complex data and information. Climate data, 
astronomic data, and company transaction data are 
just some examples of massive amounts of digital 
data repositories that must be stored and analyzed 
to find useful knowledge in them. This data and 
information patrimony canbe effectively exploited 
if used as a source to produce the knowledge 
necessary to support decision making. 

Knowledge discovery procedures in all these 
application areas typically require the creation 
and management of complex, dynamic, multi- 
step workflows. At each step, data from various 
sources canbe moved, filtered, integrated, and fed 
into a data-mining tool. By examining the output 
results, the analyst chooses which other data sets 
and mining components can be integrated in the 
workflow or how to iterate the process to get a 
knowledge model. Workflows are mapped on a 
grid by assigning abstract computing nodes to 
grid hosts and exploiting communication facili- 
ties to ensure information/data exchange among 
the workflow stages. 

The knowledge extraction process is both 
computationally intensive, and collaborative and 
distributed in nature. Unfortunately, the number of 
high-level instruments to support the knowledge 
discovery and management in distributed envi- 
ronments is very low. This is particularly true in 
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grid-based knowledge discovery (Berman, 2001), 
although some research and development projects 
and activities in this area have been activated 
in recent years, such as the Knowledge Grid, 
Discovery Net, the DataSpace project, and the 
Datacentric Grid. In particular, the Knowledge 
Grid (Cannataro & Talia, 2002) we discuss here 
provides a middleware for knowledge discovery 
services targeted to a wide range of high-perfor- 
mance distributed applications. 

Discovery Net is a project conducted at the 
Engineering and Physical Sciences Research 
Council at Imperial College (Curcin, Ghanem, 
Guo, Kohler, Rowe, Syed, & Wendel, 2002), 
whose main goal is the design, development, and 
implementation of an infrastructure for effectively 
supporting scientific knowledge discovery pro- 
cesses. Within this project, a series of testbeds 
and demonstrations have been carried out in the 
areas of life sciences, environmental modeling, 
and geo-hazard prediction. 

The building blocks in Discovery Net are the 
so-called knowledge discovery services ( KDS ), 
distinguished in computation services and data 
services. The former typically comprise algo- 
rithms, for example, data preparation and data 
mining, while the latter define relational tables 
(as queries) and other data sources. KDS are used 
to compose moderately complex data-pipelined 
processes. The composition may be carried out by 
means of a GUI that provides access to a library 
of services. An XML-based language called 
discovery process markup language ( DPML ) is 
used to describe processes. 

Among other projects for distributed data anal- 
ysis, DataSpace proposes a significant system to 
address efficient data access and transfer over the 
grid (Grossman & Mazzucco, 2002). DataSpace 
is a Web-services-based infrastructure for explor- 
ing, analyzing, and mining remote and distributed 
data. DataSpace applications employ a protocol, 
for working with remote and distributed data, 
called DataSpace transfer protocol ( DSTP ). 


DSTP simplifies working with data by provid- 
ing direct support for common operations, such 
as working with attributes, keys, and metadata. 
The DSTP protocol can be layered over special- 
ized high-performance transport protocols such 
as SABUL (Gu & Grossman, 2003), which allows 
DataSpace applications to effectively work on 
wide-area high-performance networks. 

Differently from the two environments dis- 
cussed above, the Datacentric Grid is a system 
targeted at knowledge discovery on grids designed 
for mainly dealing with immovable data (Skilli- 
corn, 2002). The nodes at which computations 
happen are called data/compute servers (DCS). 
Besides a compute engine and a data repository, 
each DCS comprises a metadata tree, which is 
a structure for maintaining relationships among 
raw data sets and models extracted from them. 
Furthermore, extracted models become new data 
sets, potentially useful at subsequent steps and/or 
for other applications. 

The grid support nodes ( GSNs ) maintain infor- 
mation about the whole grid. Each GSN contains a 
directory of DCSs with static and dynamic infor- 
mation about them (e.g., properties and usage), and 
an execution plan cache containing recent plans 
along with their achieved performance. Since a 
computation in the Datacentric Grid is always 
executed on a single node, execution plans are 
simple. However, they can start at different places 
in the model hierarchy because, when they reach 
a node, they could find, or not, already computed 
models. The user support nodes ( USNs ) carry out 
execution planning and maintain results. 

The Knowledge Grid supports knowledge 
discovery activities by providing mechanisms and 
higher-level services for searching resources and 
representing, creating, and managing knowledge 
discovery processes, and for composing exist- 
ing data services and data-mining services in a 
structured manner, allowing designers to plan, 
store, document, verify, share, and reexecute 
their workflows as well as managing their output 
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results. We designed a general framework that, 
using the generic grid services and middleware, 
defines a set of tools and services needed to sup- 
port all the main steps of a KDD process in a 
distributed environment, from the data selection 
and data-mining steps to the knowledge storing 
and interpretation. 

In the Knowledge Grid environment, discovery 
processes are represented as workflows that a user 
may compose using both concrete and abstract 
grid resources. Knowledge discovery workflows 
are defined using a visual interface that shows 
resources (data, tools, and hosts) to the user, 
and offers mechanisms for integrating them in a 
workflow. A formal representation of resources 
and workflows is stored using an XML-based 
notation in which a workflow is expressed as a 
data-flow graph of nodes, each representing either 
a data-mining service or a data-transfer service. 
The XML representation allows the workflows 
for discovery processes to be easily validated, 
shared, translated in executable scripts, and stored 
for future executions. 


KNOWLEDGE GRID COMPONENTS 
AND TOOLS 

Figure 2 shows the general structure of the 
Knowledge Grid system and its main components 
and communication interfaces. The high-level 
k-grid layer includes services used to compose, 
validate, and execute a parallel and distributed 
knowledge-discovery computation. That layer 
also offers services for storing and analyzing the 
discovered knowledge. The main services of the 
high-level k-grid layer are: 

• The data access service (DAS) provides 
search, selection, transfer, transformation, 
and delivery of data to be mined. 

• The tools and algorithms access service 
(TAAS) is responsible for searching, select- 
ing, and downloading data-mining tools and 
algorithms. 

• The execution plan management service 
(EPMS). An execution plan is represented 
by a graph describing interactions and data 


Figure 2. The Knowledge Grid general structure and components 
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flows between data sources, extraction tools, 
data mining tools, and visualization tools. 
The execution plan management service 
allows for defining the structure of an ap- 
plication by building the corresponding 
graph and adding a set of constraints about 
resources. Generated execution plans are 
stored, through the RAEMS, in the knowl- 
edge execution plan repository (KEPR). 

• The results presentation service ( RPS ) offers 
facilities for presenting and visualizing the 
knowledge models extracted (e.g., associa- 
tionrules, clustering models, classifications). 
The resulting metadata is stored in the KMR 
to be managed by the KDS (see Figure 2). 

The core k-grid layer includes two main 

services: 

• The knowledge directory service (KDS) that 
manages metadata describing Knowledge 
Grid resources. Such resources comprise 
hosts, repositories of data to be mined, tools, 
and algorithms used to extract, analyze, and 
manipulate data, distributed knowledge 
discovery execution plans, and knowledge 
obtained as a result of the mining process. 
The metadata information is represented 
by XML documents stored in a knowledge 
metadata repository (KMR). 

• The resource allocation and execution man- 
agement service (RAEMS) is used to find a 
suitable mapping between an “abstract” ex- 
ecution plan (formalized in XML) and avail- 
able resources, with the goal of satisfying 
the constraints (computing power, storage, 
memory, database, network performance) 
imposed by the execution plan. After the 
execution plan activation, this service man- 
ages and coordinates the application execu- 
tion and the storing of knowledge results in 
the knowledge base repository (KBR). 


All or part of the services provided by the 
Knowledge Grid can be resident on each grid node 
where Globus is available. If a grid node offers data 
sources but does not provides data-mining tools, 
it does not need to configure tools and algorithms 
access services. The same principle holds for the 
other services. For example, the execution plan 
management service can be configured only on 
the grid nodes on which an execution plan is pro- 
duced, whereas the RAEMS must be configured 
on each grid node that contributes to the execution 
of a KDD application. In summary, the services 
designed in the Knowledge Grid architecture are 
needed to perform all the steps of a KDD process 
on a grid, but they can be adaptively configured 
as they are necessary to run a distributed KDD 
process. 

The main components of the Knowledge Grid 
environment have been implemented and are 
available through a software prototype, named 
VEGA(visual environment for grid applications), 
which embodies services and functionalities rang- 
ing from information and discovery services to 
visual design and execution facilities (Cannataro, 
Congiusta, Talia & Trunfio, 2002). VEGA offers 
a simple way to design and execute complex grid 
applications by exploiting advantages coming 
from the grid environment. In particular, it offers 
a set of visual facilities and services that give the 
users the possibility to design applications start- 
ing from a view of the present grid status (i.e., 
available nodes and resources), and composing 
the different steps inside a structured and com- 
prehensive environment (see Figure 3). Moreover, 
VEGA overcomes the typical difficulties of grid 
application programmers, offering a high-level 
graphical interface, and by interacting with the 
knowledge directory service (KDS), to know 
available nodes in a grid and retrieve additional 
information (metadata) about their published 
resources. By using the abstractions offered by 
VEGA the user is freed from the task of coupling 
the application structure with the underlying grid 
infrastructure. 
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Figure 3. The VEGA visual interface 



VEGA integrates functionalities of the EPMS 
and other Knowledge Grid services; in particular, 
it provides for the following EPMS operations: 

• Task composition 

• Consistency checking 

• Execution plan generation 

The taskcomposition consists inthe definition 
of the entities involved in the computation and 
specification of the relationships among them. 
The task composition facility allows the user to 
build typical grid applications in an easy, guided, 
and controlled way, having always a global view 
of the grid status and the overall building ap- 
plication. Key concepts in the VEGA approach 
to the design of a grid application are the visual 
language used to describe, in a component-like 
manner, and through a graphical representation, 
the jobs constituting an application, and the pos- 
sibility to group these jobs in workspaces, as to 
define specific interdependent stages. Structured 


applications, composed of multiple sequential 
stages, exploit both the workspace concept and 
the virtual resource abstraction. Thanks to these 
entities, it is possible to compose applications 
working on the outcomes of previous phases as 
if they were available, even if the execution has 
not been performed yet. For more general applica- 
tions, comprising a not specified number of nodes 
or that can be run on different grid deployments, 
VEGA supports the so-called abstract resources. 
They allow for specifying resources by means 
of constraints (i.e., required main memory, disk 
space, CPU speed, operating system, etc.). When 
abstract resources are employed to define an ap- 
plication, an appropriate matching of abstract 
resources with physical ones, and a possible opti- 
mization phase are performed prior to submitting 
all the jobs for execution. A computation in the 
VEGA visual language is organized as a set of 
workspaces, hosting resources, and specifying 
relationships among them, so as to define one or 
more jobs. Jobs belonging to the same workspace 


291 


Using Grids for Distributed Knowledge Discovery 


are executed concurrently, whereas an arbitrary 
ordering can be specified among different work- 
spaces, by composing them to form a directed 
acyclic graph (DAG). 

The consistency checking is performed by a 
module that parses the model of the computation 
both while the design is in progress and prior to 
executing it, monitoring and driving user’s actions 
so as to obtain a correct and consistent graphical 
representation. A preprocessing of the compu- 
tation model takes place during the graphical 
composition, allowing, with a context-sensitive 
control, to define a set of well-formed jobs. The 
checking is completed by a postprocessing of 
the computation model, responsible for catching 
those error occurrences that cannot be recognized 
during the preprocessing phase. 

In the execution plan generation phase, the 
computation model is translated into an execu- 
tion plan represented by an XML document. The 
execution plan describes a data-mining computa- 
tion at a high level, neither containing physical 
information about resources (which are identified 
by metadata references), nor about status and cur- 
rent availability of such resources. In fact, specific 
information about the involved resources will be 
included prior to the allocation process. 

KNOWLEDGE GRID AND OGSA 

Grid technologies are evolving towards an open 
grid architecture, called the open grid services 
architecture (OGSA), in which a grid provides an 
extensible set of services that virtual organizations 
can aggregate in various ways (Talia, 2002). 

OGSA defines a uniform-exposed service 
semantics, the so-called grid service, based on 
concepts and technologies from both the grid 
computing and Web services communities. 
Web services define a technique for describing 
software components to be accessed, methods 
for accessing these components, and discovery 
methods that enable the identification of relevant 


service providers. Web services are, in principle, 
independent from programming languages and 
system software; standards are being defined 
within the World Wide Web Consortium (W3C) 
and other standards bodies. 

Web services and OGSA aim at interoper- 
ability between loosely coupled services that 
are independent of implementation, location, or 
platform. OGSA defines standard mechanisms for 
creating, naming, and discovering persistent and 
transient grid service instances, provides location 
transparency and multiple protocol bindings for 
service instances, and supports integration with 
underlying native platform facilities. The OGSA 
effort aims at defining a common resource model 
that is an abstract representation of both real re- 
sources, such as processors, processes, disks, file 
systems, and logical resources. It provides some 
common operations, and supports multiple un- 
derlying resource models representing resources 
as service instances. 

In OGSA all services adhere to specified grid 
service interfaces and behaviors required for 
creating and composing sophisticated distributed 
systems. Service bindings can support reliable 
invocation, authentication, authorization, and del- 
egation. To this end, OGSA defines a grid service 
as aWeb service thatprovides asetofwell-defined 
interfaces, and that follows specific conventions 
on the use for grid computing. OGSA also defines, 
in terms of the Web services description language 
(WSDL) (Christensen, Curbera, Meredith, & 
Weerawarana, 2001), mechanisms required for 
creating and composing sophisticated distributed 
systems, including lifetime management, change 
management, and notification. A first specifica- 
tion of the concepts and mechanisms defined in 
the OGSA is provided by the open grid services 
infrastructure (OGSI) (Tuecke et al., 2003). 

More recently, the Web services resource 
framework (WSRF) was adopted as a refactoring 
and evolution of OGSI aimed at exploiting new 
Web services standards, and at evolving OGSI on 
the base of early implementation and application 
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experiences (Czajkowski et al., 2004). WSRF 
codifies the relationship between Web services 
and stateful resources (called WS-resources ) in 
terms of a set of conventions on Web services 
technologies, in particular XML, WSDL, and 
WS-Addressing (Box et al., 2004). The frame- 
work describes the WS -resource definition and 
association with the description of a Web service 
interface, and describes how to make the proper- 
ties of a WS-resource accessible through a Web 
service interface. Despite OGSI and WSRF model 
stateful resources differently, as a grid service 
and a WS-resource, respectively, both provide 
essentially equivalent functionalities. Both grid 
services and WS-resources, in fact, can be cre- 
ated, addressed, and destroyed in essentially the 
same ways. The Globus Toolkit 4 is the reference 
implementation of WSRF. 

We are devising an implementation of the 
Knowledge Grid in terms of the OGSA model. 
In this implementation, each of the Knowledge 
Grid services is exposed as a persistent service, 
using the OGSA conventions and mechanisms. 
For instance, the EPMS service will implement 
several interfaces, among which the notification 
interface that allows the asynchronous delivery to 
the EPMS of notification messages coming from 
services invoked as stated in execution plans. At 
the same time, basic knowledge discovery ser- 
vices can be designed and deployed by using the 
KDS services for discovering grid resources that 
could be used in composing knowledge discovery 
applications. 

In the next section, we discuss an example of 
distributed data mining application to show how 
its execution canbenefit from the Knowledge Grid 
services provided through the OGSA model. 

AN INTRUSION-DETECTION 
APPLICATION 

The goal of the example application is to obtain 
a classifier for an intrusion- detection system. In 


particular, the mining process is performed on a 
dataset containing records generated by network 
monitoring over a given period of time. Main 
issues to face with in such applications are the 
very large size of the used dataset, and the need to 
extract a number of suitable classification models 
to be employed into an intrusion detection system. 
To this end, a number of independent classifiers 
are first obtained by applying, inparallel, the same 
learning algorithm over a set of distributed train- 
ing sets, generated through a random partitioning 
of the overall dataset. Afterwards, the best clas- 
sifier is chosen by means of a voting operation 
by taking into account evaluation criteria like 
computation time, error rate, confusion matrix, 
and so forth. 

In the scenario of the example, a user appli- 
cation interacts with Knowledge Grid nodes to 
generate classifiers built from different subsets of 
a given dataset. The C4.5 data-mining algorithm 
is used to generate classifiers as decision trees. 
After the partitioning step, each training set is 
moved to a Knowledge Grid node providing a 
C4.5 data-mining service. The induction of the 
decision trees is performed in parallel on each 
node, followed by the validation of the models 
against a testing set. The results are then moved 
back to the user that may visualize the classifiers 
to evaluate and select the obtained results. 

The application makes use of three types of 
nodes: 

• 2V , the node running the user application 
that builds the intrusion- detection applica- 
tion and visualizes the results. 

• 2V , the node on which the original dataset DS 
is located, and providing adata-partitioning 
service. 

• N ...N , n nodes providing a C4.5 data- 
mining service that perform classification 
of n different subsets of DS in parallel. In 
particular, the C4. 5 mining service provides 
both classification and validation functions 
exported to remote clients through the corre- 
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sponding grid service operations “classify” 
and “validate.” 

We assume that the user application knows in 
advance the existence of the initial dataset DS and 
the partitioning service on node N . Moreover, we 
assume that N u and N g . . .N n provide Knowledge 
Grid services, and that N g . . .N n provide services 
for the reservation of different kinds of resources 
(e.g., storage, computing cycles, etc.). 

Figure 4 shows a possible scenario for the 
example application. Apart from N u and N , only 
three computing nodes are represented (N p N 2 and 
N 3 ). The application can be performed as follows 
(see Figure 4): 

1. The user application invokes the TAAS ser- 
vice on N to locate n nodes providing the 
required C4.5 mining service. 


2. The TAAS service of N invokes the cor- 

u 

responding services on other Knowledge 
Grid nodes, in order to obtain information 
about the needed resources; contacted nodes 
reply, sending meta-information (only the 
interaction with the N TAAS is shown in 
dashed line for figure clearness). 

3. The meta-information about nodes IV,... IV is 
examined, and such nodes are identified as 
candidates for the computation. The TAAS 
service on N sends this information to the 

u 

user application. 

4. The user application builds an execution plan 
for the data process, specifying strategies 
for data movement and algorithm execution. 
The execution plan is submitted to the EPMS 
of A. 

u 

5. The EPMS invokes the reservation services 
on N g . . .N n to reserve computing cycles and 


Figure 4. A distributed data mining example 


N„ 


N n 
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storage space. In particular, on N g comput- 
ing, cycles are reserved for executing the 
partitioning of DS, and storage space is 
used to maintain the extracted subsets. On 
N 1 ...N n cycles are reserved to execute the 
classification and validation tasks, whereas 
storage space is used both to maintain the 
input subsets and the inferred classifiers. 

6. The EPMS invokes the partitioner service 
on N 0 to extract n training sets and n testing 
sets from DS. 

7. The EPMS invokes the DAS service on N g 
to transfer, to each node N....N , a train- 
ing/testing set couple. 

8. The EPMS invokes in sequence: i ) the “clas- 
sify” operation of the C4.5 mining service 
on N . . .N to generate the classifiers; ii ) the 
“validate” operation of the C4.5 mining 
service on N,...N to validate the classifiers. 
As soon as each operation is executed, a 
notification message is sent to the EPMS 
(not shown in figure). 

9. The EPMS invokes the DAS service on 
N....N to transfer the classifiers to N . 

10. The EPMS invokes the RPS service on N to 

u 

visualize the classifiers and support the user 
in evaluating and selecting the results. 

KDD APPLICATIONS ON THE 
KNOWLEDGE GRID 

In this section, we describe some significant appli- 
cations implemented on the Knowledge Grid. 

The first one is an implementation of the 
intrusion detection application described in the 
previous section (Cannataro, Congiusta, Pug- 
liese, Talia, & Trunfio, 2004). This application 
is characterized by the employment of a massive 
dataset (containing millions of records), and by 
the need of extracting a classification model to 
be employed in almost real time into the network 
security system. The Knowledge Grid, thanks 
to its high-level KDD-oriented features and its 


performance, has been a profitable and valu- 
able choice for the application development and 
execution. This application has been tested on 
Knowledge Grid deployments including 3 and 8 
nodes; the execution times have been compared 
with those of the sequential execution. The mea- 
sured speed-up for these configurations has been, 
respectively, about 2 and 5. 

Another application developed on the Knowl- 
edge Grid has been focused on bioinformatics, 
in particular, a “proteomics” application (Canna- 
taro, Comito, Congiusta, & Veltri, 2004). Protein 
function predictionuses database searches to find 
proteins similar to a new protein; thus, inferring 
the protein function. This method is generalized 
by protein clustering, where databases of pro- 
teins are organized into homogeneous families 
to capture protein similarity. The implemented 
application carries out the clustering of human 
proteins sequences using the TribeMCL method. 
TribeMCL is a clustering method through which 
it is possible to cluster correlated proteins into 
groups termed “protein family.” This clustering 
is achieved by analyzing similarity patterns be- 
tween proteins in a given dataset, and using these 
patterns to assign proteins into related groups. In 
many cases, proteins of the same protein family 
will have similar functional properties. TribeMCL 
uses the Markov clustering algorithm. The appli- 
cation comprises four phases: i) data selection, ii) 
data preprocessing, in) clustering, and iv) results 
visualization. The Data Selection phase extracts 
sequences from the database. The Data Prepro- 
cessing phase prepares the selected data to the 
clustering operation; in fact, TribeMCL needs a 
BLAST comparison on its input data. BLAST is 
a similarity search tool based on string-match- 
ing algorithms; given a string, it finds string 
sequences or subsequences matching with some 
of the proteins in a given database (this process is 
called alignment); thus, once the protein sequences 
have been extracted from the database, a BLAST 
computation has to be performed. The clustering 
phase performs the Markov clustering algorithm 
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to obtain a set of protein clusters, and finally, the 
results visualization phase displays the obtained 
results. The measurement of application execu- 
tion times has been done in two different cases: 
a) only 30 human proteins, and b) all the human 
proteins in the Swiss-Prot database. Comparing 
the execution times, we noted that the execution of 
the clustering phase is a computationally intensive 
operation and consequently, takes much more 
time when all the proteins have to be analyzed. 
However, the grid execution of these phases, us- 
ing three nodes, has been performed in a time 
reduced by a factor of about 2.3 with respect to 
the sequential execution. 

The last application we discuss here is con- 
cerned with an effort to integrate a query based 
data mining system into the Knowledge Grid 
environment (Bueti, Congiusta & Talia, 2004). 
KDDML-MQL is a system for the execution of 
complex mining tasks expressed as high level 
queries through which is possible to combine 
KDD operators (preprocessing, mining, etc.) to 
classical database operations such as selection, 
join, and so forth. A KDDML query has thus the 
structure of a tree in which each node is a KDDML 
operator specifying the execution of a KDD task 
or the logical combination (and/or operators) of 
results coming from lower levels of the tree. To the 
end of achieving the integration of such a system 
(not developed for the grid) into the Knowledge 
Grid, a slight adaptation of its structure has been 
needed. We modified KDDML into a distributed 
application composed of three independent com- 
ponents: query entering and splitting (performing 
the query entering and its subsequent splitting 
into subqueries to be executed in parallel), query 
executor, and results visualization. The distributed 
execution of KDDML has been modeled according 
to the master-worker paradigm, a worker being 
an instance of the query executor. In addition, a 
proper allocation policy for the subqueries has 
been implemented. Itisbasedboth on optimization 
criteria (as to balance the subqueries assignment 
to grid nodes), and on the structure of the tree, in 


order to correctly reconstruct the final response 
combining the partial results. Some preliminary 
experimental results, aimed at testing validity and 
feasibility of this approach, have been obtained by 
running some queries on a grid testbed, showing 
encouraging and satisfactory outcomes. 

CONCLUSION 

The grid can be effectively exploited for de- 
ploying data-driven and knowledge-discovery 
applications (Berman, Fox, & Hey, 2003). It is a 
well-suited infrastructure for managing very large 
data sources and providing high-level mechanisms 
for extracting valuable knowledge from them. To 
perform this class of tasks, advanced tools and 
services for knowledge discovery are vital. 

Here we presented the Knowledge Grid: a 
grid-based software environment that implements 
grid- enabled knowledge-discovery services built 
on the Globus Toolkit mechanisms. The Knowl- 
edge Grid can be used as a high-level system 
for providing knowledge discovery services on 
dispersed resources connected through a grid. 
These services allow professionals and scientists 
to create and manage complex knowledge-dis- 
covery applications composed as workflows that 
integrate data sets and mining tools provided as 
distributed services on a grid. They also allow 
users to store, share, and execute these knowledge- 
discovery workflows, as well as publish them as 
new components and services. The Knowledge 
Grid provides a higher level of abstraction of the 
grid resources for knowledge discovery activities; 
thus, allowing the end-users to concentrate on the 
knowledge-discovery process without worrying 
about grid infrastructure details. 

In the next years, the grid will be used as a 
platform for implementing and deploying geo- 
graphically distributed knowledge discovery 
(Kargupta, Joshi, Sivakumar, & Yesha, 2004) 
and knowledge management platforms and ap- 
plications. Some ongoing efforts in this direction 
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have been recently started. Examples of systems 
such as Discovery Net, the DataSpace project, 
the Datacentric Grid, and the Knowledge Grid 
discussed in this chapter, show the feasibility of 
the approach, and can represent the first genera- 
tion of knowledge-based pervasive grids. 

The future use of grids is mainly related to 
the ability to handle large computations and man- 
age worldwide complex-distributed applications. 
Among those, knowledge-based applications are 
a major goal. To reach this objective, the grid 
needs to evolve towards an open decentralized 
infrastructure based on interoperable high-level 
services that make use of knowledge both in pro- 
viding resources and in giving results to end users. 
Software technologies for the implementation and 
deployment of Knowledge Grids, as we discussed 
in this chapter, will provide important elements 
to build up knowledge-based applications on a 
small-sized grid or on a worldwide grid. These 
models, techniques, and tools provide the basic 
components for developing grid-based complex 
systems. Examples of such systems include dis- 
tributed knowledge management environments 
providing pervasive access, adaptivity, and high 
performance for virtual organizations in science, 
engineering, and industry needing to produce 
knowledge-based applications. 
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INTRODUCTION 

Recently, our capabilities of both generating and 
collecting data have increased rapidly. Conse- 
quently, data mining has become a research area 
with increasing importance. Data mining, also 
referred to as knowledge discovery in databases 
(Chen et al., 1996), is the search of relationships 
and global patterns that exist “hidden” among 
vast amounts of data. There are various problems 
that someone has to deal with when extracting 
knowledge from data, including characterization, 
comparison, association, classification, predic- 


tion, and clustering (Han & Kamber, 2001). This 
chapter elaborates on the problem of classifica- 
tion. Broadly speaking, pattern classification (or 
recognition) is the science that is concerned with 
the description or classification of measurements. 
More technically, pattern classification is the 
process that finds the common properties among 
a set of objects in a database and classifies them 
into different classes according to a classifica- 
tion model. 

Classical models usually try to avoid vague, 
imprecise, or uncertain information, because it 
is considered as having a negative influence in 


Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited. 


Fuzzy Miner: Extracting Fuzzy Rules from Numerical Patterns 


the inference process. This chapter accepts the 
challenge of dealing with such kind of information 
by introducing a fuzzy system, which deliberately 
makes use of it. The main idea of fuzzy systems 
is to extend the classical two-valued modeling of 
concepts and attributes like tall, fast, or old in a 
sense of gradual truth. This means that a person 
is not just viewed as tall or not tall, but as tall to 
a certain degree between 0 and 1. This usually 
leads to simpler models, which are handled more 
easily and are more familiar to the human way 
of thinking. 

After providing a brief comparative overview 
of pattern classification approaches (Section 2) and 
a short specification of the pattern classification 
domain in fuzzy systems (Section 3), the chapter 
follows the above paradigm and describes an 
effective fuzzy system for the classification of 
numerical data (Section 4). The initial idea comes 
from the fact that fuzzy systems are universal 
approximators (Kosko, 1992; Wang, 1992) of any 
real continuous function. Such an approximation 
method (Nozzaki et al., 1997) coming from the 
domain of fuzzy control systems is appropriately 
adjusted, extended, and implemented in order to 
produce a powerful working solution in the domain 
of pattern classification. An “adaptive” process 
is also introduced, developed, and incorporated 
into the previous mechanism for automatically 
deriving highly accurate linguistic if-then rules. 
The description of the methodology is combined 
with the illustration of the design issues of the 
tool Fuzzy Miner. The current work is evaluated 
(Section 5) by extensive simulation tests and by 
providing a comparison framework with another 
tool of the domain that employs a neuro-fuzzy 
approach, NEFCLASS (Nauck & Kruse, 1995). 
Finally, the chapter concludes (Section 6) by 
identifying promising directions for future work 
pointed to by this research effort. 


COMPARATIVE OVERVIEW OF 
PATTERN CLASSIFICATION 
APPROACHES 

Already, when the field was still in its very infancy, 
it was realized that the statistics and probability 
theory (Berger, 1985) had much to offer to pattern 
classification (Schalkoff, 1992). The question of 
whether or not a given pattern “belongs” to some 
pattern class may naturally be treated as a special 
case of the statistical decision theory problem. 
Effective, though, as it is, the statistical approach 
has built-in limitations. For instance, the theory of 
testing statistical hypotheses entails that a clear- 
cut yes or no answer should always decide upon the 
membership of a pattern in a given class. Clearly, 
not all of the real life patterns admit of such coarse 
decisions. Sometimes information in a pattern is 
not simply in the presence or the absence of a set 
of features, but rather the interconnection of fea- 
tures contains important structural information. 
Indeed, this relational information is difficult or 
impossible to be quantified by a feature vector 
form. This is the underlying basis of structural 
pattern classification. Structurally based systems 
assume that pattern structure is quantifiable. 
As such, complex patterns can be decomposed 
recursively in simpler subpatterns in almost the 
same way that a sentence can be decomposed in 
words. The analogy directed researchers toward 
the theory of formal languages. The process that 
results in an answer to a classification question is 
called syntax analysis or parsing. 

Fuzzy Logic and Fuzzy Systems 
for Pattern Classification 

Fuzzy logic (Zimmermann, 1996) is a superset 
of conventional (Boolean) logic that has been 
extended to handle the concept of partial truth 
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(values between “completely true” and “com- 
pletely false”). Fuzzy Pattern Classification is 
one way to describe systems and the behaviour 
of systems. Computers always need exact data to 
process. With Fuzzy Pattern Classification we do 
not need such exact information. A system can be 
described by using adjectives like “high,” “mid,” 
and “low.” 

Most applications of fuzzy systems can be 
found in the area of control engineering (fuzzy 
control). Fuzzy control applications are based on 
if-then rules. The antecedent of a rule consists 
of fuzzy descriptions of measured input values, 
and the consequent defines a possibly fuzzy out- 
put value for the given input. Basically, a fuzzy 
rule-based system provides an effective way to 
capture the approximate and inexact nature of 
the real world. In particular, fuzzy rule-based 
systems appear useful when the processes are too 
complex for analysis by conventional quantitative 
techniques or when the available information 
from the processes is qualitative, inexact, or 
uncertain. Fuzzy rule-based systems have the 
theory of Fuzzy Logic as its theoretical base. 
Zimmermann states, 

“Fuzzy set theory provides a strict mathematical 
framework in which vague conceptual phenomena 
can be precisely and rigorously studied” (Zim- 
merman, 1996, p. X). 

Fuzzy Sets: A classical (crisp) set is normally 
defined as a collection of elements or objects x s 
X, which can be finite or countable. Each element 
can either belong to or not belong to a set A, AcX. 
Such a classical set can be described in different 
ways; either one can enumerate the elements that 
belong to the set, or one can define the member 
elements by using the characteristic function 
I/A, in which 1/A(x) = 1 indicates membership 
of x to A and 1/A(x) = 0 non-membership. Pat- 
tern Classification using fuzzy logic (Cios et al., 
1998; Manoranjan et al., 1995) partitions the input 


space into categories (pattern classes) w p . . ., w n 
and assigns a given pattern v = (v i; v 2 , . . ., v n ) to 
one of those categories. If v does not fit directly 
within a category, a “goodness of fit” is reported. 
By employing fuzzy sets as pattern classes, it is 
possible to describe the degree to which a pattern 
belongs to one class or another. 

Definition. If A is a collection of objects denoted 
generically by x, then a fuzzy set A in A is a set 
of ordered pairs: 

A = {(x,p A (x)) |xeX} (3.1) 

p A is the membership function that maps X to 
the membership space M and p A (x) is the grade 
of membership (also degree of compatibility or 
degree of truth) of x in A. A widely used function 
is the so-called triangular membership function 


P m .dW 


1- 

m-x 

d 

0 



iim-d<x<m + d 
if x<m-d or x>m + d 


(3.2) 


with d > 0 and meH. This function assumes the 
maximum membership degree of 1 at the value 
m. It decreases linearly to time left and right of 
m to membership degree 0. These fuzzy sets 
are suitable for modelling linguistic terms like 
approximately zero. Of course, the triangular 
function maybe replaced by other functions (e.g., 
a trapezoidal or a Gaussian function). 

Fuzzy Rules: Fuzzy rules are a collection of 
linguistic statements that describe how a fuzzy 
inference system should make a decision regard- 
ing classifying an input. They combine two or 
more input fuzzy sets and associate with them 
an output. Fuzzy rules are always written in the 
following form: 
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IF v, is A, and v, is A n and . . . v is A THEN (v„ 

11 2 2 n n v V 

v 2 , . . ., v J belongs to class w, 

where A , A , . . ., A are input fuzzy sets and w is 
output fuzzy set. 

For example, one could make up a rule that 
says: 

IF temperature is high and humidity is high 
THEN room is hot. 

There would have to be membership functions 
that define what we mean by high temperature 
(input^, high humidity (input 2 ), and a hot room 
(output). This process of taking an input such as 
temperature and processing it through a mem- 
bership function to determine what we mean by 
“high” temperature is called fuzzification. The 
purpose of fuzzification is to map the inputs to 
values from 0 to 1 using a set of input member- 
ship functions. 

The main advantage of the above connection 
is its close relation to human thinking. This is 
also the reason that the knowledge of an expert 
can easily be incorporated into a fuzzy pattern 
classification system. But in lack of an expert 
or in case of a complex system, there is also the 
possibility of using real information/data from the 
system to build the fuzzy rules. On the other hand, 
the disadvantages are the necessity to provide the 
fuzzy rules, the fact that a fuzzy system cannot 
learn from data, and that there is no formal method 
to tune the membership functions. 

The approaches introduced so far share some 
common features and goals. Thus, boundaries 
between them are not very clear. Each has pitfalls 
and advantages, and the availability and “shape” of 
the features often determine the approach chosen 
to tackle a pattern classification problem. As far 
as fuzzy, statistical, and structural approaches are 
concerned, all are valid approaches to the clas- 
sification problem. The point is that probability 
(statistical approach) involves crisp set theory 


and does not allow for an element to be a partial 
member in a class. Probability is an indicator of 
the frequency or likelihood that an element is 
in a class. On the other hand, formal grammars 
(structural approach) have a difficulty in learning 
structural rules. Finally, fuzzy set theory deals 
with the similarity of an element to a class. As 
such, if we were to classify someone as “senior,” 
fuzzy membership makes much more sense than 
probability. If we were to classify the outcome of 
a coin flip, probability makes much more sense. 

Neural Networks and Neuro-Fuzzy 
Systems 

The course of argumentation followed so far puts 
the pattern classification theme into a technical- 
mathematical framework. Since pattern classifica- 
tion is an ability of intelligent natural systems, it 
is possible to imitate the neuron (Masters, 1993), 
the basic unit of the brain, by an analogue logical 
processing unit, which processes the inputs and 
produces an output that is either on or off. Thus, 
by extension, a simple neuron can classify the 
input in two different classes by setting the output 
to “1” or “0”. The neuron is very good at solving 
linearly separable problems, but fails completely 
at solving apparently simple problem such as the 
XOR one. This issue is easily overcome by multi- 
layer neurons that use more than one neuron and 
combine their outputs into other neurons, which 
would produce a final indication of the class to 
which the input belongs (Bigus, 1996; Craven & 
Shavlik, 1997). 

Among the previously mentioned solutions, 
fuzzy logic and neural networks can be an answer 
to the vast majority of classification problems. Both 
fuzzy systems and neural networks attempt to 
determine the transfer function between a feature 
space and a given class. Both can be automatically 
adapted by the computer in an attempt to optimize 
their classification performance. One difference 
between the two methods is that the membership 
functions of a fuzzy classifier canbe initialized in 
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a state close to the correct solution, while a neural 
network can only learn from scratch, and, as a 
result, it can only be initialized in a random state. 
But their learning capabilities are significant as 
different learning algorithms are available, and 
they have great potential for parallelism since 
the computations of the components are largely 
independent of each other. Drawbacks still exist, 
although as shown in the example, there is the 
impossibility to extract rules from neurons for 
interpretation. As such, the training of the com- 
puter to optimize the classifier is usually much 
faster with a fuzzy classifier than with a neural 
network classifier. Consequently, by combining 
fuzzy logic and neural networks (neuro-fuzzy 
systems), we can avoid the drawbacks of each 
method. Aneuro-fuzzy classifier of the area, called 
NEFCLASS, which is also used in the evaluation 
of our fuzzy system, has been introduced in Nauck 
and Kruse (1995). 

Other Approaches 

The necessarily brief overview of the field would 
be incomplete without mentioning the existence 
of some alternative approaches, which are nei- 
ther statistical nor syntactical. For example, the 
geometrical method (Prabhu, 2003) focuses on 
finding data representations or organizations that 
are perceptually meaningful, while the state-space 
method (Oja, 1983) is concerned with finding 
ways of searching effectively the hierarchical 
structures prevalent in many pattern recogni- 
tion tasks. Furthermore, case-based reasoning 
methods (Aamodt & Plazas, 1994; Leake, 1996), 
rough sets techniques (Lenarcik & Piasta, 1997; 
Pawlak, 1991; Swiniarski, 1998) and clustering 
methods (Liu et al., 2000) encompass diverse 
techniques for discovering regularities (or struc- 
tures or patterns) in complex data sets. They may 
serve to suggest either hypothetical models for 
the data-generating mechanism or the existence 
of previously unknown pattern classes. 


Finally, in many applications of fuzzy rule- 
based systems, fuzzy if-then rules have been 
obtained from human experts. Recently, various 
methods were proposed for automatically generat- 
ing fuzzy if-then rules from numerical data. Most 
of these methods have involved iterative learning 
procedures or complicated rule generation mecha- 
nisms such as gradient descent learning methods 
(Nomura et al., 1992), genetic algorithm-based 
methods (Mitchel, 1996), least-squares methods 
(Sugeno & Kang, 1998), a fuzzy c-means method 
(Sugeno & Yasukawa, 1993), and a neuro-fuzzy 
method (Takagi & Hayashi, 1991). In Wang and 
Mendel (1992), an efficientrule generation method 
with no time-consuming iterative procedure is 
proposed. 

FUZZY MINER: 

A FUZZY SYSTEM FOR SOLVING 
PATTERN CLASSIFICATION 
PROBLEMS 

In this section, we present a powerful fuzzy sys- 
tem for solving pattern classification problems 
and we provide the reader with a description of 
the components of the Fuzzy Miner, their inter- 
nal processes, and their interrelationships. The 
reader interested in a more detailed description 
of the design and implementation issues of Fuzzy 
Miner is referred to Pelekis (1999). That work has 
mainly focused on the study and understanding of 
a method proposed in Nozzaki et al. (1997). Fuzzy 
if-then rules with non-fuzzy singletons (e.g., real 
numbers) in the consequent parts are generated 
by the heuristic method proposed in Nozzaki et 
al. (1997). In this chapter, we innovatively adjust, 
extend, and implement this function approxima- 
tion method in order to produce an effective, 
working data mining tool in the field of pattern 
classification. A novel embedded “ adaptive ” 
process is also introduced, developed, and incor- 
porated into the previous mechanism for automatic 
deriving highly accurate linguistic if-then rules. 
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The main advantage of these fuzzy if-then rules 
is the simplicity of the fuzzy reasoning procedure 
because no defuzzification step is required. The 
heuristic method determines the consequent real 
number of each fuzzy if-then rule as the weighted 
mean value of given numerical data. Thus, the 
proposed heuristic method does require neither 
time-consuming iterative learning procedures nor 
complicated rule generation mechanisms. 

Design & Architecture of the 
Fuzzy Rule-Based System 

Fuzzy rule-based systems are also known as fuzzy 
inference systems, fuzzy models, fuzzy associa- 
tive memories, or fuzzy controllers. Basically, 
such fuzzy rule-based systems are composed 
of four principal components: a fuzzification 
interface, a knowledge base, a decision-making 
logic, and a defuzzification interface. Fuzzy Miner 
employs the architecture depicted in Figure 1. 

In Nozzaki et al. (1997), the authors consider a 
single-output fuzzy method in the n-dimensional 
input space [0, l] n , and we keep for the moment 
these assumptions just for simplicity reasons. The 
actual algorithm implemented in Fuzzy Miner 
introduces a multiple-output fuzzy rule-based 
system with optional task, the mapping of the 
input spaces to the [0, l] n space (normalization 
process). Of course, when normalization process 


is selected, an appropriate action is performed 
after the end of the algorithm to reversely map 
the normalized data to their primitive spaces. 
Let us assume that the following m input-output 
pairs are given as training data for constructing 
a fuzzy rule-based system 

((x p ;y p ) I P = k 2, •••, m} (4.1) 

where x p = (x pl , x pV . . ., x p J is the input vector of 
the pth input-output pair andyMs the correspond- 
ing output. 

The fuzzification interface performs a mapping 
that converts crisp values of input variables into 
fuzzy singletons. Basically, a fuzzy singleton is 
a precise value, and hence, no fuzziness is intro- 
duced by fuzzification in this case. This strategy, 
however, has been widely used in fuzzy system ap- 
plications because it is easily implemented. Here, 
we employ fuzzy singletons in the fuzzification 
interface. On the other end, the defuzzification 
interface performs a mapping from the fuzzy 
output of a fuzzy rule-based system to a crisp 
output. However, the fuzzy rule-based system 
employed in this chapter does not require a de- 
fuzzification interface. 

In the following subsections, we present in 
details the two core modules of the architecture; 
namely, the knowledge base and the decision 
making logic. 


Figure 1. Architecture of Fuzzy Miner 
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Knowledge Base 

The knowledge base of a fuzzy rule-based 
system consists of two components — a database 
and a rule base. 

Database: There are two factors that determine a 
database: a fuzzy partition of the input space and 
membership functions of antecedent fuzzy sets. 
In order to develop the appropriate infrastructure, 
Fuzzy Miner defines three corresponding para- 
metric components — Database, Fuzzy Partition, 
and Membership Function. Database provides a 
complete set of functionalities upon the data (e.g., 
normalization/denormalization process) that the 
algorithm needs in order to operate effectively. 
Someone can think of a Database as the realiza- 
tion of a real database, which enables us to store, 
retrieve, update, and generally manipulate data. 
The Database component is defined as a 2D ar- 
ray, where the first dimension corresponds to the 
row of a database table, and the second dimension 
corresponds to the column (input-output space). 

We assume that the domain interval of the z'th 
input variable x is evenly divided into K fuzzy 
sets labelled as A u , A a , . . ., A iK for i = 1, 2,. . .,n. 
Then the n-dimensional input space is divided 
into KjK, . . . K fuzzy subspaces: 


Ji ~ ... j n 1, 2, ..., K ] 
(4.2) 

For example, in the case of a two-dimensional 
input space, the fuzzy subspace (At , A 2 ) corre- 
sponds to the region shown in Figure 2(a). Figure 
2(b) shows an example of the fuzzy partition 
for Kj = 5 and K, = 5 in the case of a two-input 
single-output fuzzy rule-based system. 

The Membership Function component can be 
perceived as the mean to measure the degree of 
compatibility of a data value to a fuzzy set, or as 
the probability that this data value “belongs” to 
a fuzzy set. In order to be able to use more than 
one membership function, we adopt a generic 
representation that enables the definition of dif- 
ferent kinds of membership functions. As such, 
the user of the fuzzy classifier can use not only 
triangular membership functions, but trapezoi- 
dal and bell-shaped ones. In order to represent 
a triangular fuzzy membership function, three 
parameters are enough. However, from a practi- 
cal point of view, to use trapezoidal and/or bell- 
shaped (Gaussian) membership functions, four 
parameters are necessary. Figure 3 depicts the 
three types of membership functions that Fuzzy 
Miner supports. 


Figure 2. (a) Fuzzy subspace and (b) Fuzzy partition for K1 - 5 and K2 - 5 



(a) (b) 
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Figure 3. (a) Triangular, (b) Trapezoidal, (c) Bell-shaped 


Figure 4. Fuzzy partition structure 


The Fuzzy Partition component supports the 
notion that input and output spaces should be 
partitioned to a sequence of fuzzy sets. Each of 
these fuzzy sets has a description of its member- 
ship function. Normally, there should be on e Fuzzy 
Partition object per input and output space, but just 
for simplicity reasons, we make the assumption 
that the Fuzzy Partition represents all the fuzzy 
partitions. We further assume that all the fuzzy 
partitions are composed of the same number of 
fuzzy sets N. As such, Fuzzy Partition is a 2-D 
array of Membership Functions (see Figure 4). 
The first dimension corresponds to the input space 
number, and the second dimension corresponds 
to the fuzzy set number. Note that it is necessary 
to use a different fuzzy partition for each input 


space, because the domain intervals of the input 
variables may be different. 

The main functionality Fuzzy Partition offers 
Fuzzy Miner is the actual fuzzy partitioning tak- 
ing place at the time of its initialization. More 
precisely, in order to create Fuzzy Partition, the 
domain intervals of the input and output variables 
are needed. The domain interval of a variable x 

l 

is taken as [x . , x ], where x. and x. are 
the minimum and maximum of the variable in 
the training data set. Note that the training data 
set is considered, not the testing data set. This 
approach guarantees a minimum number of 
unpredicted outputs. Furthermore, although the 
fuzzy partition of an input space is only supposed 
to cover the domain interval of the input variable, 
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Figure 5. Fuzzy partitioning for triangular membership function 



the case of input values lying outside the domain 
interval must be taken into account. By assigning 
the value -oo to the two first parameters of the 
first fuzzy set and the value +oo to the two last 
parameters of the last fuzzy set, the fuzzy parti- 
tion corresponding to an input variable x covers 
91. In Figure 5 we present the partitioning in the 
case of a triangular membership function. 

Rule Base: The rule base consists of a set of fuzzy 
if-then rules in the form of “IF a set of conditions 
is satisfied, THEN a set of consequences canbe in- 
ferred.” We assume that the rule base is composed 
of fuzzy if-then rules of the following form: 

Rule r. . : 

If Xj is ^ . and . . . and x„ is A nj then 
y is b. ; , ji=l, 2,...,K 1 ; ...; j n =l, 2,..., K n 

Jl—Jn 

(4.3) 

where R ]i jn \s the label of each fuzzy if-then rule 
and bj. ■ is the consequent real number. These 
fuzzy if-then rules are referred to as simplified 
fuzzy if-then rules and have been used in Ichihashi 
and Watanabe (1990) and Nomura et al. (1992). 
For determining the consequent real number bj. Jn 
of the fuzzy if-then rule Rj. Jn in (4.3), let us define 
the weight of the p th input-output pair (x ;y ( ) as 

^■ 1 ..,„K) = {p A ...,(x i ,)r ) (4.4) 

where a is a positive constant. The role of the posi- 
tive constant a will be demonstrated by computer 


simulations. Using the weight W jl jn (x p ) of each 
input-output pair, the following heuristic method 
(the weighted mean value of yp’s) determines the 
consequent real number. 

b k-L =IX., n (x p ).y p /JX , n (x p ) 

p = i / p = i 

(4.5) 

Rulebase is the main component of the applica- 
tion and supports all the functionality that we need 
in order to implement the various aspects of Fuzzy 
Miner. It generates fuzzy rules from training data 
and furthermore is responsible for the decision- 
making part of the algorithm (see section 4.1.2). An 
additional task supported by our rule generation 
method is that of an adaptive procedure, which 
expands a given rulebase during the processing of 
testing data when the inference engine (decision 
making) of the algorithm is running. A Rulebase 
is implemented mainly as an array of Rules that 
in its turn is represented as an array of integers 
corresponding to the conditional part, and an 
array of Then Part objects corresponding to the 
consequent part, one element per output space. 
Then Part objects are needed in order to calculate 
the consequent parts of a fuzzy rule (the relatively 
complex fraction — nominator/denominator — of 
equation 4.5). The computational development 
of the above mathematically described process 
for inferring fuzzy rules, after given learning 
data and information concerning the number of 
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inputs and outputs of these data, is presented in 
Figure 6. 

Adaptive Procedure. Before illustrating how 
the decision-making method has been developed, 
we further introduce an adaptive procedure with 
which we are capable of refining an existing rule 
base during its application upon a specific clas- 
sification scenario. This procedure takes place 
concurrently with the decision-making process; 
namely, when testing data are examined, inferred 
output are calculated and mapped to classes. The 
idea is based on an advantage of the fuzzy-nu- 
merical methods, which is the facility to modify 
a fuzzy rulebase as new data become available. 
More specifically, when a new data pair becomes 
available, one rule is created for this data pair 


and is either added to the rule base or updated, 
if a similar rule (same conditional part) exists in 
the rule base. By this action, the consequent part 
of the rule is refined by applying the generation 
method once more for this specific conditional 
part. Thus, the “adaptive” procedure enhances 
Fuzzy Miner with incremental characteristics. 
All available information is exploited, improving 
decision making on testing data. 

Linguistic Representation. In real-world applica- 
tions, it may be desirable that linguistic rules are 
generated from numerical data. In Sugeno and Ya- 
sukawa (1993), an approach for deriving linguistic 
rules from fuzzy if-then rules with fuzzy sets in 
the consequent parts is proposed. Here, a similar 
approach is adopted for translating fuzzy if-then 


Figure 6. Rule Generation Method 


Generate rules(numberOflnputs, numberOfOutputs, startOfLearnData, endOfLeamData) 

{ 

currentRule = 0; 

allocate memory for rulebase[currentRule]; 
for all learning data pairs f 

usedData[f] = false; 
create temporary rule; 

for (i = startOfLearnData; i <= endOILearnData; i++) 

{ 

if ( !usedData[i]) 

t 

construct rulebase[currentRule]; 
set IF part of rulebase[currentRule]; 
set THEN part of rulebase[currentRule]; 
calculate weight of rulebase[currentRule]; 
for all outputs j 

numerator]]] = weight * (THEN part of rulebase[currentRule]); 
denominator]]] = weight; 
for ( j = i + 1 ; j <= endOfLeamData; j+ + ) 

{ 

set IF part of temporary rule; 
set THEN part of temporary rule; 
calculate weight of temporary rule; 

if (lusedData]]] & currentRule has same IF part as temprule) 

{ 

for all outputs 

update numerator[k]; 
update denominator[k]; 
usedData[j] =true; 

} 

} 

for all outputs 

set THEN part of ruleBase[currentRule]; 
currentRule++; 

} 

} 
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rules with consequent real numbers into rules. 
“Then” part of such rules is a linguistic label and 
corresponds to the classification of the respective 
data pairs. This approach can derive classification 
rules from fuzzy if-then rules with consequent 
real numbers, which may be generated by other 
rule generation methods, as well as the described 
heuristic method. Let us assume that fuzzy if-then 
rules in (4.3) are given. To translate consequent 
real numbers into linguistic labels, suppose that 
the domain interval of an output y is divided into 
N fuzzy sets (e.g., linguistic labels) B ,B , . ,,B V , 
which are associated with the membership func- 
tions p Bl , ..., p Bjv , respectively. For example, these 
fuzzy sets may have linguistic labels such as S: 
small; MS: medium small; M: medium; ML: 
medium large; and L: large. In this method, the 
given fuzzy if-then rules in (4.3) are transformed 
to the following fuzzy if-then rules: 

Rule R * : 

If x : is A lj and . . . and x n is A nj then 
y is B*j , with CF* jn , 
j 1 1, 2,..., Ki, ..., jn — 1, 2,..., K n , 

(4.6) 

where B* h jn is the consequent fuzzy set character- 
ized by the subsequent membership function: 

. (h-.-0 = maX ^B l (b jv -.0 |!'=U,...,iV} 

Jl—Jn 

(4.7) 


and CF* jl ■ is the degree of certainty defined as 

( 4 ‘ 8 ) 

Decision-Making Logic 


The decision-making logic is the kernel of a fuzzy 
rule-based system that employs fuzzy if-then rules 
from the rule base to infer the output by a fuzzy 
reasoning method. In this chapter, we employ the 
fuzzy reasoning method of equation 4.9 to calcu- 
late the inferred output of the fuzzy rule-based 


system. Given an input vector x p = (x , x p2 ,. . ., x p J, 
the inferred output y(x p ) is defined by: 

y( x P ) 

K„ / K l K„ 

= /-■/.. / Z-Z. U /,-L ( x r) 

Ji =1 Jn=l / Ji =1 Jn =1 

(4.9) 

where jn (x p ) is the degree of compatibility of 
the input vector x p = (x , x p . ., x pn ) to the fuzzy 
if-then rule R in (4.6), which is given by: 


• u ( x r) = M x - x Fn Jn (^p„ )• (4.10) 

From (4.9), we can see that the inferred output 
y(x p ) is the weighted average of the consequent 
real numbers b , . ’s of the KK n ... K fuzzy if- 
then rules. 

Given a testing data set, this method calculates 
the outputs of the Fuzzy Miner and performs a 
mapping from the inferred consequent real number 
to the respective fuzzy set (classification result) to 
which this real number belongs. Subsequently, this 
method stores both the original outputs and clas- 
sifications of the testing data pairs and the inferred 
outputs with theresulted classifications to an output 
Database. In order to improve results, we utilize 
the adaptive procedure, which is an embedded 
process and not an autonomous one. Finally, in 
order to evaluate the algorithm for the given testing 
data, decision-making method estimates the mean 
square errors between the desired outputy p and the 
inferred outputyp. This Performance Index (PI) 
(eq. 5.2) and the number of unpredicted results are 
returned as the output of the whole process. This 
procedure is illustrated in Figure 7. 

EVALUATION OF THE FUZZY 
MINER 

In this section, we focus on investigating the 
reliability and the validity of Fuzzy Miner. The 


309 


Fuzzy Miner: Extracting Fuzzy Rules from Numerical Patterns 


Figure 7. Algorithm for the Decision Making Engine 


Decision Making () 

{ 

construct & initialize an output database DB; 

construct & initialize one numerators denominator per output; 

unpredicted Results = 0; 

create temporary rule; 

for(i = end 0 fLearning + 1 ; i <= end 0 fTesting ; i++) 

{ 

set IF part of temporary rule ; 

set THEN part of tempo rary rule ; 

calculate weight oftemporary rule; 

re-in itialize numerators & denominators per output; 

ruleFound = FALSE; 


for(k = 0; k <= Num 0 fRules ; k++) 

{ 

estimate degree of com patibility o f temporary rule with rulebase[k]; 


if(degree > 0) 

for all outputs 

update numerators & denominators oftemprule (using degree); 


} 


if (IruleFound & temprule has same IF part as currentRule) 

{ 

ruleFound = TRUE; 
for all outputs 

update numerators & denominators of currentRule (using weight); 


} 


for all outputs 

{ 

write to output DB /performing reverse mapping 

1 . original output 

2. initial classification 

3. inferred output 

4. resulted classification 


if (denominators != 0) 

calculate performance index; 

else 


unpredicted++; 


} 


if (IruleFound) 

rule base [++num Of Rules] = temporary rule; 


for all rules 

perform reverse mapping of inferred output to classes & 

set linguistic label for current rule; //classification result 

set probab ilrty of co rrectness for cu rrent rule ; // d egre e of ce rta inty 


} 


if (w rite To D B) 

{ 

write inferred data to database; 
if (denormaliz e) 

if (norm alize) //normalization was performed on load data 
denormalize output data; 


> 


process of classification is deterministic, mean- 
ing that the same input data will always produce 
the same output. As such, in order to measure 
the performance of the methods implemented in 
Fuzzy Miner, we have run several experiments 
that result in interesting conclusions. These ex- 
periments have also been compared to the results 
derived from another classifier, which uses a 
neuro-fuzzy approach (Nauck & Kruse, 1995). 
For the experiments, we used the data set from the 
Athens Stock Exchange (ASE, 2004) market. 


The ASE data set keeps a vast amount of 
information concerning the daily transactions of 
the stock market of Greece. As has been already 
mentioned, the algorithm works with numerical 
data, and fuzzy systems are universal approxima- 
tors of any real continuous function. In order to 
take advantage of this important feature of fuzzy 
systems and for the purposes of the evaluation, 
we designed a classification task based upon the 
prediction/inference of a function that estimates 
a real number, which represents the degree of 
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fluctuation of a stock price during a day. The 
calculation of this real number, as the result of a 
function, is based upon the following information 
that the ASE database stores: 

• Max price: the maximum point in the fluc- 
tuation of the price of a stock during a day 

• Min price: the respective minimum point 
in the above-mentioned fluctuation 

• Exchanged items: the number of stocks that 
were sold/bought during a day 

• Close price: the ending point in the daily 
fluctuation of the price of the stock 

Our choice for such a function is a formula 
that takes into account three factors and estimates 
a real number in the interval [0, 3]. More specifi- 
cally, each of these factors calculates a normalized 
number from zero to one, indicating how much a 
specific stock fluctuated during a day. Based on 
these sub-formulas, the overall indication of the 
fluctuation of the stock is derived. Therefore, the 
function that estimates the consequent (output) real 
number, which Fuzzy Miner will try to infer/ap- 
proximate, is described in equation 5.1: 

f : 9 R 4 -» [0.3] and f (x v x 2 ,x 3 ,x 4 ) 

= factorl+ factor 2+ factor 3 (5.1) 

f(x v x 2 ,x 3 ,x 4 ) = 

(Maxprice - Minprice ) - MINdiff 
MAXdiff - MINdiff 

( Closeprice - Minprice ) - MINdiff 
MAXdiff -MINdiff 

Exchangedltems - MINitems 
MAXitems - MINitems 

where x p x jt x , x 4 are the four input parameters 
of the data set. 

The first factor indicates the fluctuation of the 
difference between the maximum price and the 


minimum price of a stock. MAXdiff and MINdiff 
are the maximum and minimum (respectively) 
differences in the data set between Max price and 
Min price attributes. The second factor models the 
fluctuation of the difference between the closing 
price and the minimum price of a stock. This is 
an indication of how easily a stock maintains its 
price away from the minimum price. MAXdiff 
and MINdiff are the respective maximum and 
minimum differences between Close and Min 
price columns. Finally, the third factor tries to 
strengthen the two previous factors by adding to 
them the normalized number of exchanged items. 
By this, we model the fact that if the fluctuation 
of a stock is high, then this is more important 
when the number of exchanged items is also high 
than when the number of stocks that were sold or 
bought is low. MAXitems and MINitems are the 
maximum and minimum values of Exchanged 
items attribute. 

Experimenting with Fuzzy Miner 

In order to assess the forecasting ability of Fuzzy 
Miner, we limited our experiments to the banking 
sector from which we sampled 3,000 input-output 
data pairs collected from the daily transactions of 
eight banking constitutions during a calendrical 
year (1997). A set of 1,500 tuples was used for 
learning and the remaining 1,500 tuples for test- 
ing. Note that since the fuzzy rule-based system 
can employ the “adaptive” procedure, test data 
may be learning data as well, although they do 
not participate in the creation of the initial fuzzy 
rule base. 

Fitting & Generalization Abili ty for Fra ining and 
Testing Data 

For the evaluation of the algorithm, the sum- 
mation of square errors between the desired 
output y p and the inferred output y(x p ) for each 
input-output pair ( x p ; y^ is calculated. This per- 
formance index (PI) for Fuzzy Miner is given by 
the equation 5.2: 
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m 

PI = J]{y(x p )-y p } 2 /2 (5.2) 

p = i 

The two most important factors of the fuzzy 
rule-based system are the value of a and the size 
of the fuzzy partitions. In order to understand the 
influence of these parameters on the performance 
index, the algorithm has been invoked with dif- 
ferent values of a, varying from 0.1 to 50, and a 
fuzzy partition size varying from two to 25. The 
results of the simulations are presented in Table 
1, which contains only a subset of the experi- 
mentations. Note that for each different value of 
a, we first present PI without using the adaptive 
approach, and subsequently PI using the adaptive 
procedure. 

An obvious conclusion that could be inferred 
from Table 1 is that larger sizes of fuzzy partition 
lead to a better fitting (smaller PI) to the given 
input-output data pairs. Using the table containing 
the complete results of the previously mentioned 
simulations, the PI for both the original method 
and the method using the adaptive approach have 
been plotted against the number of fuzzy sets per 
fuzzy partition. Figure 8 indicates the strength 
of the adaptive approach. However, when a > 1, 


PI might become worse due to the phenomenon 
of overfitting. Further investigation designates 
that when the number of fuzzy sets is high, PI 
decreases very slowly, whereas for a small num- 
ber of fuzzy sets, PI is much more sensitive to 
the variation of the fuzzy partition size. Finally, 
PI asymptotically converges to a specific limit as 
the fuzzy partition size increases. 

An additional observation is that for each 
specific fuzzy set, PI decreases or increases, de- 
pending on the value of a. More specifically, when 
a is less than five, PI is improving, but when a 
exceeds that limit, PI starts decreasing. The best 
fitting is presented when a = 5. As a conclusion, 
we could argue that PI can be improved by choos- 
ing the appropriate value of a. Figure 9 illustrates 
the desired output and two inferred outputs for 
two different values of a. When a = 5, it is self 
evident that the approximation of the formula is 
much better than when a is 0.1. 

Classification Success 

In order to investigate the classification success 
of Fuzzy Miner with respect to different sizes of 
the fuzzy partitioning and the values of a, the 
following table is provided. 


Table 1. Performance index against a & number of fuzzy sets 


Fuzzy sets [ a 


3 

4 

5 

6 

7 

0.1 

0.013042 

0.005884 

0.005064 

0.003990 

0.003728 

0.003467 

0.012900 

0005830 

0.005028 

0.003912 

0.003518 

0.003380 

0.5 

0.011439 

0.005519 

0.004895 

0.003954 

0.003716 

0.003464 

0.011340 

0.005453 

0.004866 

0003884 

0.003477 

0.003378 

1 

0.009960 

0005212 

0 . 004"19 

0.003915 

0.003713 

0 . 00346 7 

0.009905 

0.005117 

0.004702 

0.003864 

0.003435 

0.003392 

5 

0006252 

0004578 

0.004198 

0 003719 

0003249 

0003558 

0 . 00614 " 

0.004452 

0.004268 

0 003817 

0.003157 

0003688 

10 

0.005421 

0.004470 

0.004292 

0 003664 

0003294 

0 003132 

0.005626 

0.004452 

0.004212 

0.003767 

0.004056 

0003988 

50 

0.005767 

0 004848 

0.004583 

0.003972 

0 . 00393 ? 

0.004712 

0.005598 

0.005090 

0 . 004“89 

0.004050 

0004989 

0 005109 
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A conclusion that accords with the conclusion 
inferred previously is that when the number of 
fuzzy sets is fixed, then for values of a lower 
than five, the percentage of classification success 
increases as a approximates five. When it exceeds 
five, the trend is either to stabilize or to decrease. 
This conclusion does not stand strongly, as in the 
case of PI. This is reasonable due to the vague- 
ness that is introduced by the fuzzy sets. There 
is the possibility that PI of the classifier can be 


Figure 8. PI against 


improved without a corresponding improvement 
of the classification success. Table 3 presents the 
trend of the classifier for the case of two fuzzy 
sets (classes). 

The previous reason also explains why Table 
2 includes cases where the percentage of success 
is the same when using the adaptive approach and 
not. Except for those few situations where the two 
percentages are identical, the general trend that is 
followed is that for a number of classes less than 


size of fuzzy partitioning 



Figure 9. Fluctuation against a 



•D es ire d o u tput 
-Alpha = 0 1 
Alpha = 5 


1 75 349 523 697 871 1 045 121 9 1 393 

In p u t p a tte rn s 
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five, the adaptive approach gives higher classifi- 
cation results than the respective approach that 
is not using it. Unfavourably for fuzzy partition 
sizes more than five, the phenomenon of overfitting 


does not allow the adaptive procedure to improve 
performances. (By overfitting, we mean the situ- 
ations where the updating of the consequent parts 
of the fuzzy rules should not be performed if a 
predefined performance index is reached.) 


Table 2. Classification success against a & number of fuzzy sets 


Table 3 Classification accuracy against a 


Figure 10. Classification success against size of fuzzy partitioning 
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Finally, the easier but also the stronger infer- 
ence that someone could make from Table 2 is 
that increasing the number of classes results in 
decreasing the percentage success of the classifier. 
This conclusion is illustrated in Figure 10, which 
shows that low fuzzy partition sizes have as a 
consequence a rapid reduction of the classification 
success. On the contrary, for a high number of 
fuzzy sets, we observe a stabilization in the rate 
of reduction of the classifier. 

Optimizing the Size of the Rule Base 
Another interesting observation that someone 
might notice by running the Fuzzy Miner with 
the previous parameters, is the variation of PI 
with respect to the size of the rule base. More 
specifically, Figure 11 indicates that when PI 
decreases, the number of the produced fuzzy 
rules is augmented in a stable rate. Using this 
graphical representation, it is possible to determine 
the “optimum” number of rules with respect to 
a performance requirement. Then, assuming a 


Figure 11. Size of rule base against PI 


linear relation between the number of rules and 
the fuzzy partition size, the “optimum” number 
of fuzzy sets can be also determined. The relation 
between the fuzzy partition size and the inferred 
fuzzy rules is shown in Figure 12. 

The fact that the number of the produced rules 
is exponential interdependence of the number of 
fuzzy sets used to partition the input space, could 
also be inferred from Table 4 where the number 
of rules for different sizes of fuzzy partition is 
presented. Table 4 includes an extra row containing 
the number of rules when the adaptive approach 
is applied. As expected, the size of the rule base 
gets larger as new rules are added during the 
decision-making stage. 

Selecting the Right Type of Membership 
Function 

All of the above experiments were performed 
by selecting the trapezoidal membership func- 
tion. Legitimately, the inquiry if the other two 
types of membership function that Fuzzy Miner 


Figure 12. Rule base against fuzzy partition size 


Table 4. Produced rules for different numbers of fuzzy sets 
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Table . Average performance index for different types of membership function 


Table 6. Average classification success for different types of membership function 


supports provide better performances upon the 
classification task arises. In order to answer this 
question, the following two tables are provided, 
where each column presents the mean value of 
PI (Table 5) and the classification success (Table 
6), respectively. The presented averages are on 
all possible fuzzy partition sizes and have been 
calculated for two values of a, where Fuzzy Miner 
presents relatively stable behaviour. 

From these tables, we draw the conclusion that 
the lowest PI and the higher classification success 


are derived when using the Gaussian membership 
function. The second best fitting is accomplished 
with the trapezoidal function. There is a logical 
explanation for the differences in the perfor- 
mances of these functions. First, the trapezoidal 
membership function is better than the triangular 
because the former gives the maximum degree 
of compatibility (which is one) in more attribute 
values than the latter, which gives this maximum 
membership value just in those in which their 
value corresponds to the centroid of the triangular 


Table 7. Classification accuracy of the NEFCLASS model 


316 


Fuzzy Miner: Extracting Fuzzy Rules from Numerical Patterns 


shape. As such, trapezoidal membership function 
gives higher degrees of compatibility in average, 
so the approximation of the desired output is be- 
coming an easier task. Unfavourably, there is the 
possibility when using the trapezoidal member- 
ship function that some attributes are assigned 
the maximum degree of compatibility when they 
should be assigned lower degrees. This problem 
can be solved either by widening the big base or 
by narrowing the small base of the trapezoidal 
shape. Finally, the bell-shaped function has an 
improved behaviour as it exhibits a smoother 
transition between its various parts. 

Missing Rules 

There is the possibility that Fuzzy Miner will 
not be able to predict an output for all input data 
pairs. This may occur if there is no rule in the 
rule base that corresponds to that input data pair. 
In the simulations performed, this problem oc- 
curred only for some specific parameter values, 
and particularly for large fuzzy partition sizes. 
The number of unpredicted outputs was very 
small (rarely more than two). Nevertheless, this 
is also an additional criterion that must be taken 
into account when trying to optimize a fuzzy 
rule-based system. 

Comparing Fuzzy Miner 
with NEFCLASS 

Continuing our evaluation, we have chosen NEF- 
CLASS (Nauck & Kruse, 1995) as the method for 
contradicting our fuzzy system, as it has many 
common features and shares the same goals as 
Fuzzy Miner. We have repeated experiments 
on NEFCLASS similar to those that were used 
to explore the validity of Fuzzy Miner. In these 
experiments, we are interested in the particular 
characteristics of NEFCLASS, and in this section, 
we present a summary of the results of the simula- 
tions performed. Table 7 reviews the experiments 
that took place. 


Comparing the classification success of NEF- 
CLASS for its various parameters with the cor- 
responding percentages from the third column of 
Table 2 (number of fuzzy sets equal to three), we 
can have a clear picture of the cases that Fuzzy 
Miner performs better. In general, Fuzzy Miner 
gives higher classification performances than 
NEFCLASS. The only case where NEFCLASS 
classifies patterns with a higher rate is when it 
uses the cross validation procedure either autono- 
mously or in conjunction with one of the pruning 
strategies. In the second situation, the success of 
the classifier is even higher. The explanation for 
this is that the cross validation procedure uses the 
whole pattern set both as training and as testing 
data set, and performs several validation steps 
in order to create a classifier with the best fitting 
upon this specific data set. Another conclusion 
is that both NEFCLASS and Fuzzy Miner have 
slightly better classification results when using 
the Gaussian membership function. 

The main difference in the philosophy of 
generating fuzzy rules between NEFCLASS 
and Fuzzy Miner is not in the algorithmic part 
of the two approaches. Both algorithms utilize 
congener methods to produce the antecedent part 
of a rule, but they diversify in the calculation of 
the consequent part of the rule, by employing 
different heuristic methods. The crucial differ- 
ence that disjoins the two approaches is that 
NEFCLASS, by using the “best” rule and “best 
per class” rule learning method, diminishes the 
size of rulebase significantly. In addition to this, 
someone could decrease the number of the pro- 
duced rules even more by employing one of the 
supported pruning strategies. On the other hand, 
our approach searches the whole pattern space and 
produces rules so every training input pattern is 
satisfied by some rule. This high number of rules 
is further increased by introducing the adaptive 
approach, which creates new rules when test data 
are processed. This major difference between 
NEFCLASS and Fuzzy Miner is emanation of 
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the different philosophy and intention of the two 
applications. NEFCLASS concentrates on the 
conciseness and the readability of the fuzzy rules, 
while Fuzzy Miner tries to cover all possible input 
patterns. That is actually the main explanation 
why Fuzzy Miner outperforms NEFCFASS in 
the classification of unseen data. 

There is a significant increment in the perfor- 
mance of NEFCFASS when using the method 
of training the fuzzy sets that were initially 
constructed by the rule generation method. By 
training the fuzzy sets, we mean that the base or 
the height of the initial membership functions is 
adjusted, so the fuzzy sets can partition the pat- 
tern space in a better way. This improved way 
of fuzzy partitioning means that the degree of 
membership of a specific input value to a fuzzy 
set becomes larger or smaller according to some 
user-defined criterion. Fuzzy Miner also supports 
a very simple way of tuning the membership 
functions, which is actually the a parameter. 
This form of tuning the membership functions 
after a certain point causes overfitting to data 
and the classification accuracy of the classifier 
is worsened. Another problem with improving 
the shape of the membership functions is that the 
change that is performed is uniform, so it cannot 
take into account the uneven distribution that is 
a common case in real world data. The optimum 
solution is to try to capture this unevenness of 
the data set and transmit it to the shape of the 
membership functions (e.g., scalene triangles). 

Both NEFCFASS and Fuzzy Miner employ 
linguistic representation methods to derive fuzzy 
if-then rules. The main difference between these 
rulebases is their size. NEFCFASS produces lesser 
and smaller rules, but they are more readable and 
concise. On the other hand, Fuzzy Miner has 
the disadvantage that could extract many rules 
for huge data sets, but it has the advantage that, 
by taking into account the whole pattern set, it 
can show important exceptions or trends in the 
data set. From the point of view of the developer, 
such rules may seem useless, but from the point 


of view of an expert, these rules may identify 
exceptional behaviours or problematic cases. As 
such, NEFCFASS is more likely to lose useful 
extracted knowledge by excluding whole rules or 
parts from a rule. 

An additional issue that should be mentioned 
is that the generation of a rulebase in Fuzzy Miner 
is almost instantaneous (several seconds for large 
data sets). On the contrary, in order to improve the 
initial created classifier, NEFCLASS has to train 
the fuzzy sets, prune the rule base, and repeat 
the decision-making stage. All these operations 
result in a more readable rulebase, but the whole 
process is time consuming. Furthermore, Fuzzy 
Miner has the advantage that it can classify more 
than one output concurrently, based on the same 
set of input attributes. In order to do this in NEF- 
CLASS, one has to create different input files for 
each output and to rebuild new classifiers from 
scratch. Last but not least, in Fuzzy Miner, several 
configurations can be tried, and the best one can 
be selected on the basis of the lowest test error. 
If the number of unpredicted outputs is not zero, 
a coarser fuzzy partition should be tried. On the 
other hand, this is not necessary in NEFCLASS, 
as all these are automatically performed by the 
cross validation procedure. 

CONCLUSION AND FUTURE WORK 

This chapter studies the pattern classification prob- 
lem as it is presented in the context of data mining. 
More specifically, an efficient fuzzy approach 
for classification of numerical data is described, 
followed by the design and the implementation of 
its corresponding tool called Fuzzy Miner. The 
approach does not need a defuzzification process; 
it can be utilized as a function approximator, 
while by slight changes can be used as a predictor 
rather than as a classifier. The framework is highly 
flexible in that its components are configurable to 
meet various classification objectives. Linguistic 
representation of the produced fuzzy rules makes 
the classifier interpretable by native users, whereas 
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the introduction of the adaptive procedure enables 
expanding and improving the rulebase while 
examining unseen testing patterns. Fuzzy Miner 
was evaluated using the Athens Stock Exchange 
(ASE, 2004) data set. The strategy adopted by 
Fuzzy Miner was shown to be successful, and the 
results of the created classifier were presented. 

Additional future work is planed in various 
aspects of Fuzzy Miner. To start with, pruning 
strategies could be used to improve the interpret- 
ability of the classifier. These pruning strategies 
could be automatic, or some control could be given 
to the user over the pruning process. Secondly, 
we have already started designing an algorithm 
for training the initially created fuzzy sets by 
changing the length of the base or the height of a 
membership function, so representing the reality 
with greater precision. Additionally, in adaptive 
procedure, we can refine or discard some of the 
new rules based on the indications of our prun- 
ing strategies. As such, it won’t be necessary to 
execute the pruning module for the whole rulebase 
from scratch every time the adaptive approach 
is used to improve the classifier. Furthermore, a 
knowledge expert should be provided with the 
capability to initialize the rulebase externally, or 
to change existing rules that do not agree with 
the expert’s domain knowledge. We can further 
help the expert in the preprocessing stage by 
providing some statistics on the training data. 
Another idea is to attach to the system an algo- 
rithm to automatically determine the number of 
fuzzy sets for each variable and a clear criterion 
of how “good” are the produced fuzzy sets. A 
new user interface that supports graphical and 
textual displays (e.g., displays of the fuzzy sets) 
would be beneficial for interpreting the results of 
Fuzzy Miner (Kopanakis & Theodoulidis, 2003). 
Finally, a long-term goal is to integrate Fuzzy 
Miner with a neural network and to propagate 
the outcome to a genetic algorithm that would 
extract the optimum solution upon a specific 
classification task. 


REFERENCES 

Aamodt, A., & Plazas, E. (1994). Case-based 
reasoning: Foundational issues, methodological 
variations, and system approaches. AI Comm., 
7, 39-52. 

ASE (2004). The Athens Stock Exchange closing 
prices. Retrieved March 26, 2004: http://www. 
ase.gr/content/en/MarketData/Stocks/Prices/de- 
fault.asp 

Berger, J.O. (1985). Statistical Decision Theory 
and Bayesian Analysis (second edition). Springer- 
Verlag. 

Bigus, J.P. (1996). Data Mining with Neural 
Networks: Solving Business Problems. McGraw- 
Hill. 

Chen, M.S., Han, J„ & Yu, PS. (1996). Data 
mining: An overview from a database perspec- 
tive. IEEE Transactions on Knowledge and Data 
Engineering, 8(6), 866-883. 

Cios, K., Pedrycz, W., & Swiniarski, R. (1998). 
Data Mining Methods for Knowledge Discovery. 
Kluwer Academic Publishers. 

Craven, M.W., & Shavlik, J.W. (1997). Using neu- 
ral networks in data mining. Future Generation 
Computer Systems, 13, 211-229. 

Han, J., & Kamber, M. (2001). Data Mining: 
Concepts and Techniques. Morgan Kaufmann. 

Ichihashi, H., & Watanabe, T. (1990). Learning 
control system by a simplified fuzzy reasoning 
model. Proceedings of IPMU’90, (pp. 417-419). 

Kopanakis, I., & Theodoulidis, B. (2003). Visual 
data mining modeling techniques for the visual- 
ization of mining outcomes. Journal of Visual 
Languages and Computing, Special Issue on 
Visual Data Mining, 14(6), 543-589. 

Kosko, B. (1992). Fuzzy systems as universal 
approximators. Proceedings ofFUZZ-IEEE ’92, 
(pp. 1153-1162). 


319 


Fuzzy Miner: Extracting Fuzzy Rules from Numerical Patterns 


Leake, D.B. (1996). CBR in context: The pres- 
ent and future. In D.B. Leake (Ed.), Case-Based 
Reasoning: Experience, Lessons and Future 
Directions (pp. 3-30). AAAI Press. 

Lenarcik, A., & Piasta, Z. (1997). Probabilistic 
rough classifiers with mixture of discrete and 
continuous variables. In T.Y. Lin & N. Cercone 
(Eds.), Rough Sets and Data Mining: Analysis for 
Imprecise Data (pp. 373-383). Kluwer Academic 
Publishers. 

Liu, B., Xia, Y., & Yu, P. (2000). Clustering 
through decision tree construction. Proceedings 
of ACM CIKM. 

Manoranjan, V.S., Lazaro, A. de Sam, Edwards, 
D., & Aathalye (1995). A systematic approach to 
obtaining fuzzy sets for control systems. IEEE 
Transactions on Systems, Man and Cybernetics, 
25(1). 

Masters, T. (1993). Practical neural network 
recipes in C++. Academic Press. 

Mitchel, M. (1996). An Introduction to genetic 
algorithms. MIT Press. 

Nauck, D., & Kruse, R. (1995). NEFCLASS—A 
neuro-fuzzy approach for the classification of 
data. ACM Press. 

Nomura, H., Hayashi, I., & Wakami, N. (1992). 
A learning method of fuzzy inference rules by 
descent method. In Proceedings of FUZZ-IEEE 
’92, (pp. 203-210). 

Nozzaki, K., Ishibuchi, H., & Tanaka, H. (1997). 
A simple but powerful heuristic method for gen- 
erating fuzzy rules from numerical data. Fuzzy 
Sets and Systems 86, 251-270. 

Oja, E. (1983). Subspace methods for pattern 
recognition. John Wiley. 


Pawlak, Z. (1991). Rough Sets, theoretical as- 
pects of reasoning about data. Kluwer Academic 
Publishers. 

Pelekis, N. (1999). Fuzzy m iner: A fuzzy system 
for solving pattern classification problems. 
M.Sc. Thesis, UMIST. Retrieved June 15, 2004: 
http://users.forthnet.gr/ath/pele/HOME_PAGE_ 
NIKOS_PELEKIS/Download/ 

Prabhu, N. (2003). Gauge groups and data clas- 
sification. Applied mathematics and computation, 
138(2-3), 267-289. 

Schalkoff, R.J. (1992). Pattern recognition: 
Statistical, structural and neuralapproaches. 
John Wiley. 

Sugeno, M., & Kang, G.T. (1998). Structure 
identification of fuzzy model. Fuzzy Sets and 
Systems, 28, 15-33. 

Sugeno, M., & Yasukawa, T. (1993). A fuzzy- 
logic -based approach to qualitative modeling. 
IEEE Trans. Fuzzy Systems, 1, 7-31. 

Swiniarski, R. (1998). Rough sets and principal 
component analysis and their applications in fu- 
ture extraction and selection, data model building 
and classification. In S. Pal & A. Skowron (Eds.), 
Fuzzy Sets, Rough Sets and Decision Making 
Processes. Springer-Verlag. 

Takagi, H., & Hayashi, I. (1991). NN-driven fuzzy 
reasoning. Approximate reasoning, 5, 191-212. 

Wang, L.X. (1992). Fuzzy systems as universal 
approx i mators. I n Proceedings ofFUZZ-IEEE’92, 
(pp. 1163-1170). 

Wang, L.X., & Mendel, J.M. (1992). Generating 
fuzzy rules by learning from examples. IEEE 
Trans. Systems, Man Cybernet, 22, 1414-1427. 

Zimmermann, H.-J. (1996). Fuzzy set theory 
and its applications (third edition). Kluwer 
Academic. 


320 


Fuzzy Miner: Extracting Fuzzy Rules from Numerical Patterns 


ENDNOTE 

* A short version appears in the informal Pro- 
ceedings of the 1 st Intermatopmal Workshop 
on Pattern Representation and Management 
(PaRMa’04), Heraklion-Crete, Greece, March 
2004. 
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ABSTRACT 

QOSPF (Quality of Service Open Shortest Path First) based on QoS routing has been recognized as a 
missing piece in the evolution of QoS-based services on the Internet. Data mining has emerged as a tool 
for data analysis, discovery of new information, and autonomous decision making. This article focuses 
on routing algorithms and their applications for computing QoS routes in OSPF protocol. The proposed 
approach is based on a data mining approach using rough set theory, for which the attribute-value system 
about links of networks is created from network topology. Rough set theory offers a knowledge discov- 
ery approach to extracting routing decisions from attribute set. The extracted rules then can be used to 
select significant routing attributes and to make routing selections in routers. A case study is conducted 
in order to demonstrate that rough set theory is effective in finding the most significant attribute set. 
It is shown that the algorithm based on data mining and rough set offers a promising approach to the 
attribute selection problem in Internet routing. 


INTRODUCTION works needs to satisfy QoS (Quality of Service) 

demands. Since finding optimal solutions to QoS 
With the development of high-level applications routing is an NPC (nondeterministic polynomial 
of data communication networks, routing in net- time completeness) problem, a node would not be 
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able to maintain network information base timely 
with constant changes of network states (Liu, Xu, 
Xu, & Cui, 2003; Chickering 1996). IP routing 
protocols long have been an essential element of 
internetworking. Recently, this technology has 
been in the focus of new development, as routing 
has evolved to handle the needs of next-genera- 
tion networks. Data connection provides a full 
range of high-function portable Unicast IP routing 
software products, including BGP, OSPF, and RIP 
for both IPv4 and IPv6 networks. Traditional IP 
routing protocols, therefore, have been extended 
substantially in a number of areas. 

In order to adapt to new demands of computer 
networks, it is necessary to introduce new feasible 
and efficient schemes. In order to find an optimal 
path with only one attribute metric such as band- 
width or number of hops, the traditional OSPF 
(Open Shortest Path First) uses the cost metric, 
which is an unsigned 16-bit integer in the range 
of 1 to 65,535. The default cost for interfaces is 
calculated, based on the bandwidth by the formula 
108 /BW, with BW being the bandwidth of the in- 
terface expressed as a full integer of bps (Bruno, 
2003). This mechanism may lower the utilization 
of network resource and cause load imbalance, 
and it cannot satisfy QoS requirements. The 
problem of QoS routing is challenging, because 
selecting paths that meet multiple QoS attribute 
constraints is a complex algorithmic problem. As 
current routing protocols already are reaching the 
limit of feasible complexity, it is important that 
the complexity introduced by QoS support should 
not impair the scalability of routing protocols 
(Li, Zheng, & Nahavandi, 2003). Therefore, the 
QOSPF based on IP QoS Routing is developed 
(Crawley, Argon Networks & Nair, 1998). Since 
then, many researches started to investigate this 
open problem of routing optimization in QoS 
extensions to OSPF protocols. 

The theory of rough sets recently has emerged 
as a major mathematical approach for managing 
uncertainty that arises from inexact, noisy, or in- 
complete information (Pawlak, 1991). The problem 


considered in this article is to decide the link rank 
by mining a series of link-state attributes based 
on rough set theory. A case study is conducted 
in this article in order to show that the reduction 
algorithm based on rough set theory can offer an 
attractive method to resolve the attribute selection 
problem in routing table of IP networks. 

QoS-BASED NETWORKS ROUTING 
PROBLEM 

In QoS-based routing, path selection for routing 
typically is formulated as a shortest-path opti- 
mization problem; that is, to select a series of 
network links connecting source and destination 
nodes such that particular objective attributes 
(e.g., cost, bandwidth, delay) are satisfactory 
(RFC 2386). Because the problem of calculating 
a path subject to multiple attribute constraints 
has been proved NP-complete for many common 
attributes combinations, usually a compromise 
is made by choosing (mining) a subset QoS pa- 
rameters. This selection focuses on the mining 
of an appropriate path based on link attributes 
information and QoS requirements in the network. 
As a result, any algorithm that selects any two 
or more hop counts, time delay, delay jitter, and 
loss probability as important attributes will try 
to optimize these paths. 

The availability of network routing and network 
protocols to police network resources suggests a 
natural solution. The data mining mechanisms that 
directly use existing QoS mechanisms based on 
rough set theory could have been developed for 
constant bit-rate and low-bandwidth unreliable 
networks. Interactive data mining applications 
often exhibit busty-traffic patterns and operate 
on a large routing table. For data mining appli- 
cations operating on tera-byte-sized data, path 
optimization is extremely important and largely 
expected (RFC 2676). Data mining applications 
driven by humans often have varying needs in 
terms of link quality and router performance. 
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However, applications often can adapt to resource 
constraints by trading off resulting quality for 
router performance, if network resources are con- 
strained. In other words, the resulting quality can 
be sacrificed in a controlled manner, when link 
resources are available with router and response 
time is critical. 

For data mining and rough set applications, 
user requirements are specified in terms of per- 
formance and QoS. One of the main tools of data 
mining based on rough set is rule induction from 
raw data represented by a database. Real-life 
data are frequently imperfect (i.e., erroneous, 
incomplete, and uncertain). In a routing process, 
the problem is to determine which next node to 
select for next hop and which attributes to use for 
decision at a router. Considering computations 
at a router, typically while a single attribute is 
used as the decision variable, one would consider 
extensions to more than one base link attribute 
(e.g., Bandwidth and Delay <•••). Limiting 
oneself to selecting base attributes, the problem 
can be formulated in order to determine f(x l ,...,xj 
for all x.(i= 1 , 2 ,..., n), where {x.} are the attributes 
and f(x v ...,xj measures the goodness of x. as 
decision attributes. Data mining is becoming 
an interdisciplinary field, drawing work from 
areas including database technology, knowledge 
acquisition, and mathematics such as the rough 
set theory. The rough set theory is a relatively 
new mathematical approach to problems with 
data imprecision, vagueness, and uncertainty. 
The concept of reduction decision table is very 
useful for feature selection. Because decision 
table includes condition attributes or features and 
decision attributes of categories, the procedure of 
feature selection based on decision table is distinct 
and effective (Lu & Zhang, 2005). 

The importance of attribute selection is its 
potential for speeding up the processes of routing 
and improving accuracy of classification. Rough 
sets can be used for attribute reduction, where 
attributes that do not contribute to the classifica- 
tion of the given training data can be identified 


and, consequently, removed. Rough sets also can 
contribute to the relevance analysis, where the 
contribution or significance of each attribute is 
assessed with respect to the given classification 
task. The problem of finding minimal subsets of a 
given set of attributes that describe concepts in a 
given data set is proven to be NP-hard (Chickering, 
1996). However, algorithms to reduce computa- 
tional intensity have been proposed. For example, 
a method of using a discernability matrix in rough 
set is proposed to store the differences between 
attribute values for each pair of data samples 
(Kusiak, 2001). Rather than searching on the 
entire data set, the matrix is searched instead to 
detect redundant attributes. The application of the 
rough set theory can solve this problem success- 
fully. Usually, the link information of networks is 
classified by many QoS parameters such as link 
propagation delay, link available bandwidth, link 
jitter, possibility of connection and hop counts, 
and so forth. Then, protocol QOSPF can select 
the best path with the link rank. We perceive that 
data mining techniques based on rough set can be 
applied in order to obtain a reduced representation 
of routing attribute of data set that is much smaller 
in volume yet closely maintains the integrity of 
the original route table. 

One of new data mining theories (Kusiak, 
2001) is the rough set theory in routing attribute 
reduction and is used for attributes reduction 
of routing table, finding hidden route-decision 
patterns, and generation of new route-decision 
rules. 


RELATED WORK 

QoS-Based Routing 

QoS-based routing is defined as “a routing mecha- 
nism under which paths for flows are determined 
based on some knowledge of resource availability 
in the network as well as the QoS requirement 
of the flows” (RFC 2386) or “a dynamic routing 
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protocol that has expanded its path-selection cri- 
teria to include QoS parameters such as available 
bandwidth, link and end-to-end path utilization, 
node resources consumption, delay and latency, 
and induced jitter.” In short, it is a dynamic rout- 
ing scheme with QoS considerations. Routes that 
can satisfy QoS requirements of a new flow rely 
on both the knowledge of the flow’s requirements 
and the information about availability of resources 
in the network. In addition, for the purpose of ef- 
ficiency, it is important for an algorithm to account 
for the amount of resources that the network has 
to allocate to support any new flow. QoS-based 
routing is supposed to find a path from source to 
destination that can satisfy a user’s requirements 
on bandwidth, end-to-end delay, and so forth. In 
addition, this has to be performed dynamically 
in coping with changes. In case there are several 
feasible paths available, the path selection can be 
based on some policy constraints. For example, 
we can choose a path that costs less in terms of 
money or the one via the designated service router. 
Path selections that are based on demand forecast 
are not accurate. Users may connect to another 
telecommunications operator, or they may con- 
nect to the network earlier or later than forecast. 
This article presents a method for optimizing a 
router in the presence of uncertainty, based on 
rough set theory. 

In general, a network prefers to select the 
cheapest or best path among all paths suitable 
for a new flow, and it even may decide not to ac- 
cept a new flow for which a feasible path exists, 
if the cost is deemed too high. Accounting for 
these aspects involves several metrics on which 
the route is based (RFC 2386), which include the 
following: 

• Possibility of Connection: Usually, in 
traditional networks, possibility of link 
connection is high, and this metric can be 
ignored. But in wireless network, it is very 
important. 


• Link-Available Bandwidth: As mentioned 
earlier, we currently assume that most QoS 
requirements are derivable from bandwidth. 
We further assume that associated with 
each link is a maximal bandwidth value 
(e.g., the link physical bandwidth or some 
fraction thereof that has been set aside for 
QoS flows). Since in order for a link to be 
capable of accepting a new flow with given 
bandwidth requirements, at least that much 
bandwidth still must be available on the link, 
then the relevant link metric is, therefore, the 
(current) amount of available bandwidth. 

• Link Propagation Delay: This quantity is 
meant to identify high latency links (e.g., 
satellite links) that may be unsuitable for 
real-time requests. This quantity also needs 
to be advertised as part of extended LSAs 
(Level Service Agreements), although 
timely dissemination of this information is 
not critical, as this parameter is unlikely to 
change (significantly) over time. 

• Link Jitter: This quantity is used to measure 
change of link delay. A path with a smaller 
jitter is preferable. 

• Hop Count: This quantity is used to mea- 
sure path cost in the network. A path with a 
smaller number of hops is preferable, since 
it consumes less network resources. As a 
result, the path selection algorithm will at- 
tempt to find the minimum number of hops 
in a path. 

The routing will focus on the selection of an 
appropriate path, based on link-attribute metrics 
information and flow requirements on the Internet. 
Let m(i,j) be a metric of link(i,j). For P = (i,j, k , ..., 
m, n ), a metric m(p) (Jonath & Guo, 2002) is: 

• Additive if 

m(p) = m(i, j ) + m( j,k) + ... + m(m, n); 

• Multiplicative if 

m{p) = m(i, j)m(j, k)...m(m, n); 
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• Concave if 

m(p) = min{m(z, j), n)}. 

For example, metrics such as propagation delay, 
link jitter, and hop count are additive, the possibil- 
ity of connection is multiplicative, and bandwidth 
is concave. In QoS-based routing, paths of packet 
flows would be determined based on the previous 
QoS requirements. The main objective of QoS- 
based routing is to realize dynamic determination 
of feasible paths; QoS-based routing can deter- 
mine a path from among possibly many choices 
that have a good chance of accommodating QoS 
of the given flow. Feasible path selection may be 
subject to policy constraints, such as path cost, 
provider selection, and so forth (Jonathan & Guo, 
2002). It successfully optimizes resource usage. 
A network state-dependent, QoS-based routing 
scheme can aid in an efficient utilization of net- 
work resources by improving the total network 
throughput. Such a routing scheme is a basis of 
efficient network engineering. 

QOSPF Routing Algorithm 

In traditional data communication networks, rout- 
ing is concerned primarily with the connectivity. 
Routing protocols usually characterize networks 
with a single routing attribute and use shortest- 
path algorithms for path computations, which are 
typically transparent to any QoS requirements. 
As a result, routing decisions are made without 
any awareness of link-resource availability and 
relevant user requirements. 

OSPF is defined in RFC 2383. It is a link-state 
routing protocol that uses Dijkstra’s shortest paths 
to destinations. In OSPF, each router sends link- 
state advertisements about itself and its links to 
all its adjacent routers. Each router that receives a 
link-state advertisement records the information 
in its topology database and sends a copy of the 
link-state advertisement to each of its adjacency 
routers. All the link-state advertisements can reach 
all other routers in the same area, which enables 


each router in the area to have an identical topology 
database. A router does not send out routing tables 
but link-state information about its interfaces. 
When the topology databases are completed, each 
router individually will calculate a loop tree and 
a shortest-path tree. Destinations outside the area 
also are advertised in link-state advertisements. 
These, however, do not require that routers run 
the SPF (Shortest Path First protocol) algorithm 
before they are added to the routing table. Changes 
of all metrics need to be advertised as part of 
extended LSAs so that accurate information is 
available to the path selection algorithm. OSPF 
is QoS extensions to OSPF and support for QoS 
routing, which can be viewed as consisting of 
three major components in RFC 2386: 

1. Obtain the information needed to compute 
QoS paths and select a path capable of meet- 
ing the QoS requirements of a given user 
request. 

2. Establish the path selected to accommodate 
a new request. 

3. Maintain the path assigned for use by a given 
request. 

QOSPF uses a link-state algorithm in order 
to build and calculate the shortest path to all 
known destinations. The algorithm by itself is 
quite complicated. The following is at a very 
high level and in a simplified way for describing 
the algorithm: 

1. Upon initialization or due to any change in 
routing information, a router will gener- 
ate a link-state advertisement, which will 
represent the collection of all link-states on 
that router. 

2. All routers will exchange link states by 
means of flooding. Each router that receives 
a link-state update should store a copy in its 
link-state database and then propagate the 
update to other routers. 
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3. After the database of each router is com- 
pleted, the router will calculate a Shortest 
Path Tree to all destinations. The router 
uses the Dijkstra algorithm to calculate the 
shortest path tree. The destinations, the as- 
sociated cost, and the next hop to reach those 
destinations will form the IP routing table. 

4. In case no changes in the QOSPF network 
occur, such as cost of a link or a network being 
added or deleted, OSPF should be very quiet. 
Any changes that occur are communicated via 
link-state packets, and the Dijkstra algorithm 
is recalculated to find the shortest path. 

In order to reduce communications overhead, 
routing algorithms based on link status information 
such as SPF send broadcast messages that contain 
only information of link status. SPF will broadcast 
messages that contain only the node’s link status 
instead of the entire routing table, ft seems easy to 
collect information on communication latency of 
links and to calculate routes with minimal delay; 
however, this is almost impossible in large net- 
works, because we need to collect information of 
communication latency of all links frequently by 
message broadcasting, which leads to extremely 
heavy communication overheads. In addition, 
delayed information for the latency may create 
far from optimal routes, ft is not uncommon for 
an application to transmit data with variable QoS 
requirements. For example, a video application may 
require different levels of service, depending upon 
the content of the connection. Consider a video 
sequence that consists of a highly dynamic set of ac- 
tion scenes followed by a relatively static sequence. 
The first part, due to rapid camera movement, is 
reasonably tolerant with data loss and corruption 
but intolerant with high-link jitter. In contrast, the 
static scenes are tolerant with link jitter but require 
minimal data loss and corruption. In a very large 
network such as the Internet, it is essential that 
routing algorithms be scalable. In order to achieve 
the scalability for adaptive network routing algo- 
rithms, it is expected to select (mine) important 


QoS attributes, depending on rough set with as 
little communication overheads as possible. 

ROUGH SET METHODOLOGY 

Relative Reduction of Rough Set 

Rough set theory is a mathematical approach 
to information analysis introduced by Pawlak 
(1999). 

Rough set based on feature selection is an 
extension of conventional set theory that supports 
approximations in making decisions. The rough 
set itself is an approximation of a vague data set by 
a pair of precise concepts called lower and upper 
approximations, which are a classification of the 
domain of interest into disjoint categories. The 
lower approximation is a description of domain 
objects that are known with a certainty belong- 
ing to a subset of interests, whereas the upper 
approximation is a description of objects that pos- 
sibly would belong to the subset (Parthasarathy, 
2001). The main approach to finding rough-set 
reduction is concerned with the discernability 
matrix. This section describes the fundamental 
ideas behind this approach. 

An information system S is a quadruple (U, 
A, V, f), where U = {Xj, x 2 , ..., xj denotes the set 
of all objects, A is the set of all attributes that are 
classified further into two disjoint subsets: the 
condition attributes C = {a i \ i = 1,..., m) and deci- 
sion attribute D = {d}, such that A-CuD and 
C u D = 4*. ^ = U is a set of attribute values, 

aeA 

where V is the domain of attribute a. Notation a (x) 

a y 

denotes value of x. on attribute a., f :U x A—>V 
is an information function, which appoints the 
attribute value of every object x in U. 

A network can be modeled as a graph G = (V, 
E ). Nodes (V) of the graph represent switches, rout- 
ers, and hosts (here represent routers). Edges (E) 
represent communication links. A symmetric link 
has the same attribute value (bandwidth, propaga- 
tion delay, etc.). In order to illustrate the operation 
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of these, an example data set of routing attributes 
(Table 1) is used, where C = {a 1 ,a 2 ,a 3 ,a 4 } is a 
routing attribute set in which ‘a ’ represents the 
bandwidth of link: number 1 denotes enough, 2 
denotes available, and 3 denotes non-enough. 
Notation ‘a ’ represents the propagation delay 
of link: 1 denotes low, 2 denotes normal, and 3 
denotes high. Notation ‘a 3 ’ represents the failure 
probability of link: 1 denotes low, 2 denotes nor- 
mal, and 3 denotes under normal. Notation ‘a’ 
represents the bit-error ratio of link: 1 denotes low, 
2 denotes acceptable, and 3 denotes insufferable. 
Decision attribute D = {d} = {1, 2, 3): number 
1 for good, 2 for normal, and 3 for bad (each of 
their weights is 1, 2, 3), respectively. 

Considering data sets in Table 1 with five 
objects, four features o 1 - a 4 and decision d, fea- 
tures denoting process (route) parameters (e.g., 
bandwidth, delay) and the decision is the routing 
performance (good, normal, bad). Some decision 
rules can be extracted from data sets of Table 1: 

• Rule 1: if (o 2 = 1) then (d = 1) 

• Rule 2: if (a 1 < 2) and ( a 3 = 1) and ( a 4 < 2) 
then (d = 1) 

• Rule 3: if (u 1 = 3) or (a 4 = 3) or (a 2 = 3) then 

(d = 3) 

• Rule 4: if (u 1 < 2) and ( a 2 = 2) and (a 4 = 2) 
then (d = 2) 


Table 1. An example data set 
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Rough set attributes reduction method can 
remove redundant conditional attributes (or rout- 
ing attributes) from a normal data set (Jensen & 
Shen, 2004). With any Pc A, there is an associated 
equivalence relation IND(P ) or U/P: 

IND(p) - {(x,y) g U 2 | Vo g P,a(x ) = a(y)} 

( 1 ) 


Let X c U, the lower and upper approximation of 
set A, be defined as PX = {x| x gU,[x] p c A}and 
PX = {x | x g U,[ x] p nX ^ ^respectively. LetP 
and Q be equivalence over U, then the positive 
region can be defined as: 


POSp(Q) = |^J PX (2) 

X eQ 

A positive region contains all objects in U 
that can be classified in attributes Q using the 
information in attribute P. Using this definition of 
the positive region, a set of attributes Q depends 
on a set of attributes P. The rough set degree of 
dependency of a set attributes Q on P is defined 
in the following: 


Yp(Q) = 


POSpjQ) | 

u 


( 3 ) 


Reduction of attributes is achieved by com- 
paring equivalent relations generated by sets of 
attributes. Attributes are removed so that reduced 
set can provide decision features that have the same 
quality as the original. A reduction is defined as 
a subset X of cardinality R of conditional at- 
tribute set Y such that y x (D) = y y (D), where D 
is a set of decision attributes. A given data set X 
may have many attribute-reduction sets, so a set 
R of all reduction sets is: 


R-{I \I <^Y,y x (D) = y y (D)} (4) 

The minimal reduction set R . a R is defined 

nun — 

as the set of any reduction set searched in R with 
cardinality: 
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R min ={I\IczR,\/XeR,\I\<\X\} (5) 

These minimal subsets can make decision 
classes with the same discriminating power as 
the whole condition attributes. 

A discernability matrix is a nxn matrix in 
which the classes are diagonal. In the matrix, the 
(condition) attributes that can be used to discern 
between the classes in the corresponding row 
and column are inserted (Li, 2003). Data items 
of discernability matrix contain attributes used 
to discern objects. A single element in data items 
must be a member of reduction; hence, single data 
items can be included individually through the 
mining process based on rough set theory. At the 
same time, data items that include the elements 
are removed from the discernability matrix. Since 
the reduction is to find the minimal attribute set 
discerning all objects, the data items of the discern- 
ability matrix contain all discerning information 
of objects. The intersection of a reduction and 
data items of the discernability matrix cannot be 
empty. If there is an empty intersection between 
some data items with reduction, then correspond- 
ing objects would not be discerned by reduction 
(Li, Tang, Ni, & Yang, 2002). 

The information system’s discernability matrix 
M[C D (i, j)] nxn where C D (i, j) is defined as: 

C D (U J ) 

”1 °> d(x i ) = d(x j ) (6) 

where i, j = 1, • • • , n. 

In the definition of the discernability matrix, 
when | C D (/', j) |=1, the attribute in C D (i, j ) is one 
of the core attribute sets. All of attributes in C D (i, 
j ) where | C D (i, j) |= 1 consist of the core attribute 
set, which may be null. C D (i, j ) = 0 when C D (i, 
j ) contains a core attribute. Then, a new simple 
matrix can be obtained. 


Cn C V2 ... C ln 

c c c 

C n i C n2 ... C nn 
We can get: 


L ij = V a i 

(7) 

a,eC 9 

i-f 

< 

II 

(8) 

Cjj 

>-r 

> 

ii 

‘•■4 

(9) 


The reduced attribute set is LyCore(A). 
Here, Core(A ) denotes the core attribute set. For 
finding reduces, the decision-relative discernabil- 
ity matrix is of more interest. This only considers 
those link discernabilities that occur when the 
corresponding routing attributes differ. 

From Table 1, the discernability matrix 
M[C D (i, j)] nxn can be given as follows, (see Box 
1). 

In this matrix, | C D (4, 6) |= 1, | C D (6, 7) |= 1, we 
can get Core(A ) ={a i; o 4 |. The decision-relative 
discernability matrix found in Table 1 is produced. 
For example, it can be seen from the table that 
objects 3 and 8 differ in each attribute. Although 
some attributes in objects 1 and 2 differ, their 
corresponding decisions are the same, so no entry 
appears in the decision-relative matrix. Grouping 
all entries containing single attributes forms the 
core of the data sets. Here, the core of the set is 
{cz^ a 4 }. This means that a 1 and a 4 are the most 
important routing attributes. When C D (i, j ) con- 
tains a 1 or a 4 , then we set C D (z, j ) = 0. 

The reduction of the data sets can be derived 
by converting the previous expression from 
conjunctive normal form to disjunctive normal 
form. Although this is guaranteed to discover 
all minimal subsets, it is a costly operation that 
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Box 1. 
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renders the method impractical for even medium- 
sized data sets. For most applications, a single 
minimal subset is required for data reduction. 
This has led to approaches that consider finding 
individual shortest prime implicates from the 
discernability function. A common method is to 
incrementally add those attributes that occur with 
the highest frequency in the function, removing 
any clauses containing the attributes, until all 
clauses are eliminated. However, even this does 
not ensure that a minimal subset is mined; the 
search could proceed down to non-minimal paths 
(Yang & Chiam, 2000). 

Algorithm for Selection of QoS 
Routing Attributes 

The following algorithm is proposed to classify 
link-rank to realize QOSPF. 

The first stage is for data preprocessing. 

• Step 1: From the historical routing record of 
the subnet and the QoS information about 
the links, the information system about the 
links can be built. 

• Step 2: Then we can draw the discernability 
matrix about the information system and 
can conclude the routing-reduction table 
(Apostolopoulos et al., 1999). 


1. RT = cp , Core =tp; 

2. For every b eA, compute the equivalent class 
U/P ; 

3. Construct discernability matrix 
M (A) = (C D (z, j)} nxn , where 1< i,j< n. 

4. Doloop, VC D (i, j) e M (A), if(| C D (z, j) |= 1) 
then Core = Core u {C D (z, j )}; 

5. RT = Core; 

6. VCore, We set C D (z, j ) = 0 when 
aeC D (i,j) 

The second stage is for the mining process. 

• Step 3: Do reduction of attribute based on 
the information system. 

• Step 4: The logical rules can be concluded 
from routing-reduction table. Created rules 
of rough decisions are saved in rule set. 

• Step 5: With the knowledge of QOSPF, the 
link with QOS attributes can be mined. The 
third stage is for obtaining the best path with 
the knowledge of QOSPF. 

The link-state algorithm presented as fol- 
lows is known as Dijkstra’s algorithm based on 
on-demand computation of QoS paths, which 
is described to illustrate how the algorithm can 
select a minimum-delay path with a maximum 
bandwidth. Some researchers proposed the delay 
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Exhibit A. 


Initialization: 

for (each destination n in set of nodes in the network) do 
begin 

DT[n] =infinty; 

BW[n] =undefined; 

NB[n] undefined; 
end 

DT[s] =0; /*the source node s 7 
BW[s]}=infinty; 

Compute QoS routing paths: 

S =the set that contains all node in the network; 

while (S is not empty) do 

begin 

u=the node in S whose value in the field DT is minimum; 
S =S-{u}; 

for (each node v adjacent to u) do 
begin 

if ((b(u, v))>= bandwidth requirement) and 

(DT[v]>DT[u]+d(u, v)) then 

begin 

DT[v] =DT[v]+d(u, v); 

BW[v] =min{BW[u], b(u,v)}; 

if (the node u is the source node s) then 

NB[v] =v; 

else 

NB[v] =NB[u]; 

end 

end 

end 


and bandwidth attributes to be mined from these 
steps (Jonathan & Guo, 2002). (See Exhibit A.) 

In the previous steps, b(i,j ) denotes the avail- 
able bandwidth on the edge between nodes i and 
j. Notation d(i, j) denotes the propagation delay 
on the edge between nodes i and j. Notation BW 
is the maximum available bandwidth on a path 
between the source s and destination n. Notation 
DT is the minimal delay on a path between source 
s and destination n. Notation NB is the associated 
routing information. 

With the help of data mining based on rough- 
set theory, the simplification of the routing table is 
to simplify the condition attributes in the routing 


table; after that, the routing-attribute reduction 
possesses the ability of the whole-routing-attribute 
table before simplification but possesses more 
important condition attributes (Zhang, 2003). 

The approaches previously described are based 
on Dijkstra’s shortest-path algorithms. The Dijk- 
stra algorithm traditionally has been considered 
more efficient than standard shortest-path compu- 
tations because of its lower worst-case complexity. 
The benefit of using Dijkstra’s algorithm in QoS 
path selection has a greater synergy with the 
existing OSPF implementation. On-demand path 
computation of Dijkstra-based routing-attribute 
mining provides advantages in yielding better 
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routes and minimizing the need for storage of data 
structures, if there are reduced routing attributes for 
QoS paths (Jonathan & Guo, 2002). The asymptotic 
worst-case complexity of an implementation of 
Dijkstra’s algorithm is 0(E logN), where N is the 
number of nodes in the network graph, and E is 
the number of the edges. The complexity of this 
rough-set-processing computation is 0(M*N +M 2 ), 
where M is the number of attributes. 


CASE STUDY 

Figure 1 shows a weighted graph model as an 
example of subnet’s link to validate the previous 
algorithm. The node denotes the router. The num- 
bers on the links denote the link-num. Routing is 
determined by the links’ QoS attribute parameters, 
such as available bandwidth, propagation delay, 
link jitter, bit-error ratio, and connection possibil- 
ity. We presume that the standard of classification 
is the link rank that can be described by I, If, ... 
VI based on historical routing data. The routing 


Figure 1. Subnet Topology 



algorithm is used to select the best path from A to 
D with the given QoS attributes. 

Building the Information Base 

Table 2 shows the information base from Figure 1. 
All of the values in the table denote the measure- 
ments of the attributes (attribute-weight abstract 
from the real world). 

S-(U,Cu {d},V, f ) is an information system, 
where ‘Li’ is a finite non-null set of link objects, and 
‘C’ is a finite non-null set of link’s QoS attributes. 
A link object has an IP address and participates in 
the QoS routing graph. Here, 

U={ 1, 2, 3, ..., 10}; 

C = {Cj, c 2 , c 3 , c 4 , c 5 }; 

D = (I, II, III, IV, VI}; 

V 1 = {1, 2, 3, 4, 5, 6, 7, 8}; 

V 2 = (1, 2, 3, 4, 5}; 

V 3 = (0, 1, 2, 3, 4, 5}; 

V 4 = (1, 2, 3, 4, 5, 6, 8}; 

V 5 = {1, 2, 3}. 

Table 2 expresses the function of the informa- 
tion ‘f’. 

Reduction Information Base 

According to Formula (6), discernability matrix 
M(C D (i, /)) in Table 2, can be given as follows, 
(see Box 2.) 


Table 2. Information base of links 


\c 

Available 

Propagation 

Link 

Bit error 

Connection 

Link 

u\ 

bandwidth^) 

delaytq) 

jitterCQ 

ratio(C 4 ) 

possibility(C 5 ) 

rank(d) 

1 

4 

4 

1 

5 

3 

IV 

2 

5 

3 

1 

8 

2 

hi 

3 

1 

4 

3 

5 

3 

IV 

4 

1 

4 

3 
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VI 
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6 

2 

2 

3 

2 

II 

6 

3 

4 

3 

5 

1 

IV 

7 

1 

5 

1 

3 

1 

V 

8 

3 

1 

0 

3 

3 
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7 

5 

4 

1 

3 

III 

10 

8 

1 

1 

2 

3 
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Discernability matrix is used to the same 
partition of the data as the whole set of attributes 
V. To do this, one has to construct the so-called 
discernability function. This is a Boolean func- 
tion constructed in the following way (Lu & 
Zhang, 2005). 

For each element C fJ - (C y ^ 0), which is not 
empty, the disjunctive normal form ( DNF ) logic 
expression Formula (7) can be derived. For ex- 
ample: 

L\ 2 = C 1 V C 2 V C 4 V C 5’ 

l i, 4 =c 1 vc 3 vc 5 ; 

L 2,5 =C 1 VC 2 VC 3 VC 4 
L 9,10= C 1 VC 2 VC 3 VC 4 

Do the conjunctive operation of each DNF 
and get the conjunctive normal form (CNF) logic 
expression Formula (8). In this step, CNF of each 
part in the matrix then can be calculated: 

CNF l = c, v c 2 v c 3 v c 4 v c 5 ; 

CNF 2 =CjVC 2 vc 3 vc 4 
CNF 3 = q v c 2 v c 3 v c 4 v c 5 ; 

CNF 4 = (CjVC 2 vc 4 vc 5 )ac 5 = (q ac 5 )v(c 2 a c 5 ) 
CNF 5 = q v c 2 v c 3 v c 4 ; 


CNF 6 = q v q; 

CNF 7 - c 2 v c 5 ; 

CiVFg = q v c 3 ; 

CNF g = q v c 3 v c 4 ; 

CNF l0 = q v c 4 . 

According to Formula (9), the CNF of 
L can be transformed into a new DNF, L’= 
(q a c 2 a c 5 ) v (q a c 3 a c 5 ) v (q a c 4 a c 5 ).Each 
CNF part in DNF that stands for a routing-attri- 
bute reduces result, and all the routing attributes 
in DNF are necessary for routing selection. We 
obtain three reductions of routing attributes: /? = 
(c,, c 2 , c 5 }, F 2 = {q, c 3 , cj and F 3 = (q, c 4 , c 5 }. The 
core of the attributes is Core(A) -R 3 ni? 2 r\R 3 = 
{q,c 2 ,c 5 } o{q , c 3 , c 3 } r^c 3 ,c 4 ,c 3 }— {q,c 3 }. 

On the other hand, in the previous matrix, 
| C D (3, 4) |=1, | C D (4, 6)|=1, we also can get 
Core(A). This means that q and a 4 are the most 
important routing attributes. When C D (i, j ) con- 
tains q or q, then we set C D (i, j) = 0. 

Now a new simple matrix can be obtained, 
(see Box 3.) 

The reduced routing-attribute set can be 
(q a c 2 a q) v (q a q a q) v (qAC 4 AC 5 ) . 
It means that the reduction attributes can be 
q a q a q, q a q a cv , q a q a q. We now take 
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[0000000 0 0 0 

000000 0 00 

00000 0 00 

0 0 0 0 c 2 c 3 c 4 0 0 

0 0 0 0 0 0 

0 0 0 0 0 

0 0 0 0 

0 0 0 

0 0 
0 


only qAC 2 A c 5 as an example, and then obtain 
Table 3. 

Logic Rules 

We can obtain the associated equivalent relation 
IND(P ) in the result of reduction: 

TVD(q) = {{4,7}, (6, 8), {1}, {2}, {5}, (9), {10}} 
IND(c 2 ) = { (1, 3, 4, 6}, {2} , (7, 9} , (8, 10} } 

IND(c 5 ) = { (1, 3, 8, 9, 10} , (2, 5} , (4, 6, 7} } 

IND(d) = {{ 1,3, 6}, (2, 9}, {4}, {5}, {7}, (8, 10}} 


In Table 3, we consider the classification of 
each object on routing attributes, respectively, and 
then check whether the intersection of arbitrary 
classifications is empty or not. If intersection is 
not empty, the value of attribute is the core value 
in this object; otherwise, there is no core value 
in this object, which is denoted by (Starzyk, 
Nelson, & Sturtz, 1999). We can obtain Table 4. 

We find that in initial route information, Table 
2, there are five attributes besides the decision 
attribute. In general, the router will record these 
attributes as historic data, so there are some at- 
tributes that are dispensable to “Best Path.” After 
removing the redundant attributes by rough set 
theory, all the decisions are listed in Table 5. We 
can derive 12 rules based on this sample. 

(1) (q, 3) a (c 2 , 1) a (c 5 , 3) -> (d , I) 

(2) (c 1 ,8) a (c 2 , 1) a (c 5 , 3) — » (d , I) 

(2 )(c 2 ,l) a(c 5 ,3) — » (d,I) 

(3) (c 1; 6) a (c 2 ,2) a (c 5 ,2) — » (d,II) 

(4) (c l5 5) a (c 2 ,3) a (c 5 ,2) —> (d,III) 

(5) (c x , 7) a (c 2 , 5) a (c 5 , 3) — » (d , III) 

(6) (c 1; 4) a (c 2 , 4) a (c 5 , 3) — »• (d , VI) 

(7) (c 1 ,l) a (c 2 ,4) a (c 5 ,3) — » (d,VI) 

(7 )(c 2 ,4) a (c 5 ,3) — » (d,VI) 

(8) (q, 3) a (c 2 , 4) a (c 5 , 1) — > (d , IV ) 

(9) (q,l) a (c 2 ,5) a (c 5 ,1) -> (d,V) 

(10) (q,l) a (c 2 , 4) a (c 5 ,l) -> (d,W) 


Table 3. The decision table 
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baiidwidthtq) 

Propagation 
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rank(d) 
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Table 4. The reduction table 


\c 

u\ 

Available 

bandwidth^) 

Propagation 
delay (C 2 ) 

Connection 

possibility(C 5 ) 

Link rank 

(d) 

1 

- 

4 

3 
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5 

3 

2 

III 

3 

- 

4 

3 

IV 

4 

1 

4 

1 

VI 

5 

6 

2 

2 

II 

6 

3 

4 

1 

IV 
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1 

5 

1 

V 

8 

- 

1 

3 

I 

9 

7 

5 

3 

III 

10 

- 

1 

3 
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Table 5. All the possible decisions table 
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bandwidth(C ; ) 

Propagation 

delay(C,) 

Connection 

possibility(C 5 ) 

Link 
rank (d) 
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4 

4 
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IV 
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- 
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3 

IV 

2 
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3 

2 
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1 

4 

3 

IV 

3' 

- 
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3 
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1 
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1 

VI 
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6 

2 

2 

II 

6 

3 

4 

1 

IV 

7 

1 

5 

1 

V 
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3 

1 

3 

I 

8' 

- 

1 

3 

I 

9 

7 

5 

3 

III 

10 
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1 
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I 
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- 

1 

3 
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The derived rules can be applied to a great 
number of data in order to distinguish the link 
into six ranks. Simplification ofrouting attributes 
needs not select all routing attributes for keep- 
ing the consistence of the routing table. That is 
to say, some of the link attributes are not to be 
selected in Dijkstra’s algorithm based on on-de- 
mand computation of QoS paths. For instance, to 
the link Num. 8 and link Num. 10, route selection 
only would consider the propagation delay and 
connection possibility. We found that the link at- 
tribute with the same rows could decide the same 
decision as before. For the compatible data table, 
the most excellent attribute set can be mined by 
rough set theory. 

The links of the subnet are classified in six 
ranks. Based on the Dijkstra algorithm (see 
www.dataconnection.com/network/default.htm) 
and the mined link-rank attributes, we can find 


Figure 2. The rank of link 



the best path from nodes A to D in Figure 2 as 
A — » E — > F -» D. The routing path is the best on 
condition that bandwidth is sufficient as well as 
available, and bit error ratio is acceptable. 

In the context of QoS path selection, potential 
benefits of the previous algorithm are even more 
apparent. As mentionedbefore, efficient selection 
of a best path for flows with QoS requirements 
usually cannot be handled using a single-attribute 
optimization routing criterion. While multi-at- 
tribute path selection is used to be an intractable 
problem, the routing algorithm based on attribute 
mining and rough set to handle an important at- 
tribute that is reflective of network resource at 
no more additional cost of complexity. The cor- 
responding asymptotic worst-case complexity is 
0(E log A) + 0(M * A + M 2 ). 

CONCLUSION 

The complication of traditional solution schemes 
to QoS Routing is NP completeness. A data min- 
ing method for QoS routing based on rough set 
theory has been presented in this article. Based 
on QoS routing concepts and rough set theory, 
we studied rule-mining algorithms for selecting 
routing attributes and routing data reduction. The 
case study has shown that the method is sound in 
selecting the best route in networks. In this article, 
link data are classified into different ranks with 
rough set theory according to QoS attributes, and 
QOSPF based on QoS routing can be realized. 
Rough-set theory can be applied to deal with 
link data with given QoS attributes. Rough-set 
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theory also can be applied to rapidly rank QoS 
attributes in terms of their significance in path 
selections. The proposed approach of QOSPF 
based on rough-set theory can select a path with 
QoS parameters and can offer QoS-based services 
in Internet routing. For further work, run-time 
analysis with changes of connectivity and diam- 
eter of connection graphs should be considered. 
Moreover, the tests in a real network environment 
also may be considered. 
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